Architecture

Technical reference for the local-first AI chat platform.

Mobile tip: for the best reading experience, rotate your phone to landscape when reviewing large tables.

Overview

Local Chat is a self-hosted AI assistant that runs entirely on your machine. The browser loads a single-page app from Flask, which proxies requests to Ollama for LLM inference and optional neural speech services for voice interaction.

Flask: Python web server with modular blueprints.
Ollama: Local LLM inference at localhost:11434.
Whisper: Neural speech-to-text (optional).
Coqui TTS: Neural text-to-speech (optional).
SQLite: Analytics database.
Caddy: TLS reverse proxy for HTTPS.

High-Level Architecture

The system follows a three-tier architecture: presentation (browser SPA), application (Flask API), and data/inference (Ollama + SQLite + neural speech models).

Request Flow

How a user message travels through the system:

User input: The composer captures text (or transcribed speech) and builds a message array with mode and model selection.
API call: The browser streams POST /api/stream with messages, mode (fast/normal/deep), and optional model override.
Language detection: The backend analyzes the latest user message and tags it with an ISO language code (en/es/fr).
Prompt assembly: Flask injects a localized system prompt based on mode + language, plus a language guard to ensure reply consistency.
LLM inference: The request proxies to Ollama's streaming endpoint; tokens arrive chunk by chunk.
Response cleanup: Hidden <think> reasoning blocks are stripped in real-time before reaching the client.
Final steps: The UI renders streaming tokens, captures metrics (TTFT, tok/s), and persists the session to disk.

Core Components

Browser SPA — Single-page application served from static/. Handles streaming chat, session sidebar, settings panel, and speech I/O. State persisted in localStorage.

Flask API — Lightweight Python server with modular blueprints. Handles auth, streaming proxy, session CRUD, analytics, and speech endpoints.

Ollama LLM — Local model inference server running on localhost. Streams tokens via /api/chat with curated model allowlist.

Speech-to-Text — Whisper-based neural transcription. Supports browser Web Speech API or server-side Whisper with configurable model/device.

Text-to-Speech — Coqui TTS neural synthesis with quality presets (normal/better/best). Streams chunked audio as NDJSON WAV.

Storage Layer — JSON files for chat sessions, SQLite for analytics, GeoIP database for country lookup, and structured log files.

Edge Proxy — Caddy reverse proxy handles TLS termination, HTTPS certificates, and forwards traffic to Flask on localhost.

Benchmark Subsystem

The benchmark subsystem is a separate execution path from normal chat requests. It runs controlled tasks across models/datasets, persists run state, and exposes read APIs used by docs pages.

Runner: Executes benchmark scope (models, datasets, tasks), records per-task outcomes, and updates run lifecycle status.
Run state storage: Persists benchmark telemetry and task status for safe resume/reset behavior.
Read APIs: Exposes benchmark status/dataset/last-task endpoints consumed by the monitor UI.
Report layer: benchmark_guided.html and benchmark_autonomous_claude.html provide post-run analysis views over collected results.
Live monitor: benchmark_monitor.html polls run state and metrics to visualize progress in near real time.

Operational split: reports are static analysis snapshots; monitor is runtime observability for in-progress runs.

Live Benchmark Monitor

Page: /docs/benchmark_monitor.html. This is a read-only runtime observability console. It is intentionally more complex than the dashboard because it must track in-flight execution, not only aggregated history.

Data channels and cadence

Run/state channel: /api/benchmark/status is polled at high frequency (200ms UI loop with request throttling) and is the source of truth for run identity, lifecycle, scope, and rolling metrics.
Dataset/task channel: /api/benchmark/datasets every ~3s for per-dataset progress and task-level pills.
Last-task channel: /api/benchmark/last_task every ~3s for detailed prompt/response/evaluation context, with "keep last known task" behavior to avoid UI flicker.
Fallback channels: /api/gpu (~5s) and /api/temperature (~10s) are used when those samples are missing from benchmark state payloads.

State model and reconciliation

Run identity gate: monitor keys state by run_id plus start timestamp and resets charts/pills when a new run or restart is detected.
Progress guards: protects against regressions from transient payloads (e.g., non-monotonic percentage), while still handling explicit reset transitions.
Staleness detection: running state is downgraded to NOT RUNNING if update age exceeds threshold, preventing false "active" UI when backend polling stalls.
Workflow state machine: derives cooling -> thinking -> streaming -> evaluating -> done from server task lifecycle and cooldown metadata, including live timers.
Cross-source merge: combines status payload, dataset payload, and last-task payload into one coherent presentation layer (cards, pills, streaming panel, charts).

UI composition

Realtime telemetry cards: tokens/sec, file descriptors, CPU, GPU, GPU temperature, disk I/O with bounded local history and adaptive chart scaling.
Task-list layer: model overview pills, datasets overview pills, and current dataset task pills with status coloring and live-task highlighting.
Streaming details layer: shows active request text, partial response, evaluation outcome, and transitional states (WAITING, EVALUATING, DONE).
Operator controls: persisted local preferences for hiding/showing top cards, pills groups, and docs chrome, optimized for long monitoring sessions.

Operational intent: use this page during execution and incident analysis. Use guided/Claude reports for post-run interpretation and decision summaries.

→ User guide: Live Benchmark Monitor

Analytics Dashboard

Page: /dashboard (served by dashboard_routes.py). This is an admin-only analytics surface over recorded HTTP events.

Access control: route and API are protected by admin allowlist checks; non-admin requests return 403.
Data API: /api/dashboard/analytics/summary with optional limit parameter (default behavior in UI: last 500 events).
Fetch model: snapshot fetch on load and on user-triggered refresh/filter changes (no fixed auto-refresh interval).
Derived analytics: unique users/IPs, active users (10m), country aggregation, path-group decomposition, per-user activity, and paginated event table.
Filter composition: country, API group, and username filters can stack; optional "hide local/private IPs" filter rewrites all panel totals from the same filtered subset.

→ User guide: Analytics Dashboard

Project Structure

local-chat/
├── app.py                     # Flask entry point, auth middleware, blueprint registration
├── src/
│   ├── api/                   # Route blueprints
│   │   ├── chat_routes.py     # Streaming chat, mode prompts, metrics
│   │   ├── session_routes.py  # Session CRUD operations
│   │   ├── stt.py             # Whisper speech-to-text endpoint
│   │   ├── tts.py             # Coqui text-to-speech endpoint
│   │   ├── auth_routes.py     # Login/logout, token management
│   │   ├── dashboard_routes.py# Dashboard analytics API
│   │   └── analytics_routes.py# Client action tracking
│   ├── core/                  # Configuration and utilities
│   │   ├── config.py          # Environment variables, model allowlist
│   │   ├── auth.py            # Token storage and validation
│   │   ├── analytics.py       # SQLite analytics helpers
│   │   └── logging.py         # Structured logging setup
│   ├── services/              # Business logic layer
│   │   ├── session.py         # Session storage and persistence
│   │   ├── metadata.py        # AI title/summary generation, idle scheduler
│   │   ├── ollama.py          # LLM client helpers
│   │   ├── geoip.py           # Country lookup service
│   │   └── gpu.py             # GPU utilization reading
│   └── audio/                 # Speech processing
│       ├── common.py          # Shared audio utilities
│       └── tts/               # TTS runtime, chunking, normalization
├── static/                    # Frontend assets
│   ├── index.html             # Main SPA shell
│   ├── dashboard.html         # Analytics dashboard
│   ├── js/                    # JavaScript modules
│   ├── css/                   # Theme and UI styles
│   └── docs/                  # Documentation pages
├── benchmark/                 # Benchmark framework and assets
│   ├── run_benchmark.py       # Benchmark runner entry point
│   ├── config*.yaml           # Benchmark scope/config presets
│   ├── datasets/              # Task datasets (chat + shortform)
│   ├── magistral/             # Benchmark scripts and model-specific runs
│   ├── results/               # Generated benchmark outputs
│   └── archive/               # Historical benchmark docs/experiments
├── chats/                     # Session storage (JSON files per user)
├── db/                        # Analytics SQLite + GeoIP database
├── log/                       # Server logs
└── deploy/                    # Deployment scripts and Caddyfile

API Reference

All /api/* endpoints require Bearer token authentication (except /api/login). Streaming endpoints return plain text or NDJSON.

Endpoint	Method	Purpose
`/api/login`	POST	Authenticate and receive token
`/api/stream`	POST	Stream chat completion
`/api/chat`	POST	Non-streaming chat
`/api/stop`	POST	Cancel active stream
`/api/metrics`	GET	Retrieve stream metrics (TTFT, tok/s, tokens)
`/api/sessions`	GET	List user sessions
`/api/session`	GET/DELETE	Load or delete session
`/api/save`	POST	Persist session to disk
`/api/stt`	POST	Transcribe audio
`/api/tts/speak`	POST	Synthesize speech (NDJSON audio stream)
`/api/gpu`	GET	GPU utilization
`/api/benchmark/status`	GET	Current benchmark run status and scope
`/api/benchmark/datasets`	GET	Dataset-level progress and task completion summary
`/api/benchmark/last_task`	GET	Last executed task details for monitor display
`/api/dashboard/analytics/summary`	GET	Admin analytics snapshot for dashboard panels (supports `?limit=`)
`/config`	GET	Runtime configuration
`/health`	GET	Ollama availability check

Inference Pipeline

The inference pipeline is deterministic and fully local. It assembles a prompt, streams tokens, sanitizes hidden reasoning, and renders rich UI output.

Step objectives:

Language detection: Keep system prompt in the user's language for stable style.
System prompt: Enforce mode constraints and block chain-of-thought output.
Language guard: Prevent drift in multilingual conversations.
Streaming proxy: Preserve real-time UX while isolating Ollama.
Sanitization: Remove hidden reasoning before the UI sees it.
UI render: Convert Markdown + LaTeX into formatted output with MathJax.
Metrics: Capture TTFT, tokens/sec, and totals for diagnostics.

Mode prompts: Each mode injects a localized system prompt that controls response length and format:

Fast: "Answer in ≤2 plain sentences" — for quick factual queries.
Normal: "Reply plainly in ≤8 sentences; no tables" — balanced responses.
Deep: "Explain thoroughly with brief lists when useful" — detailed explanations.

Model routing: The backend enforces a model allowlist configured via environment variables. Users can select from curated models in the UI, but unauthorized models are rejected.

Neural Speech

Whisper STT: The model slices audio into mel spectrograms, processes them through a Transformer encoder, then decodes tokens autoregressively. Configurable model size (tiny → large), CPU or GPU inference, beam search with temperature, and automatic language detection.

Coqui TTS: Text is normalized and chunked, then a neural vocoder predicts mel spectrograms from character/phoneme embeddings. A decoder converts spectrograms to raw audio waveforms. Quality presets (normal/better/best), multi-language voice selection, and NDJSON streaming for low latency.

Mode	STT Path	TTS Path	Latency
Browser	Web Speech API	SpeechSynthesis API	~50ms
Server	Whisper neural model	Coqui neural vocoder	~500-2000ms

Data Model

Session Object: Each chat session is stored as a JSON file:

{
  "id": "abc123def456",
  "owner": "username",
  "title": "User-defined title",
  "title_ai": "AI-generated title",
  "summary_ai": "Brief conversation summary",
  "pinned": false,
  "updated_at": "2025-12-30T10:30:00+00:00",
  "messages": [
    { "role": "user", "content": "Hello!", "language_ai": "en" },
    { "role": "assistant", "content": "Hi there! How can I help?" }
  ]
}

Analytics Schema: Request telemetry logged to SQLite:

CREATE TABLE analytics (
  id INTEGER PRIMARY KEY,
  ts TEXT,              -- ISO timestamp
  username TEXT,        -- Authenticated user
  method TEXT,          -- HTTP method
  path TEXT,            -- Request path
  ip TEXT,              -- Client IP
  country TEXT,         -- GeoIP country code
  user_agent TEXT,      -- Raw UA string
  ua_browser TEXT,      -- Parsed browser
  ua_os TEXT,           -- Parsed OS
  ua_device TEXT,       -- Device type
  group_label TEXT,     -- Action category
  subgroup_label TEXT   -- Action detail
);

Model Catalog

The UI exposes a curated selection of LLMs optimized for different use cases:

Model	Parameters	Architecture	Owner	Best For
`gemma3:4b`	4.3B	Dense	Google	Speed-critical interactive chat
`deepseek-r1:8b`	8.2B	Dense	DeepSeek	Reasoning and analysis
`qwen3:8b`	8.2B	Dense	Alibaba	Balanced general-purpose
`gpt-oss:20b`	20.9B	MoE	OpenAI	Quality content generation
`magistral:24b`	23.6B	Dense	Mistral AI	Multilingual specialist

Security Model

Authentication: Basic auth or Bearer tokens with configurable TTL. Tokens stored in-memory with automatic expiration. Guest mode available with rate limits.
Authorization: Admin-only routes for analytics dashboard. Non-admin users restricted to allowed modes and models. Per-user chat isolation.
Local Inference: Ollama binds to localhost only. No data leaves the machine unless explicitly configured. All models run on-device.
TLS Termination: Caddy handles HTTPS with automatic certificate management. All external traffic encrypted in transit.

Configuration

Key environment variables for customizing deployment:

Variable	Default	Description
`OLLAMA_URL`	http://127.0.0.1:11434	Ollama server endpoint
`MODEL`	gpt-oss:20b	Default LLM model
`SUMMARY_MODEL`	gpt-oss:20b	Model for metadata generation
`STT_MODE`	browser	Speech-to-text mode (browser/whisper)
`TTS_MODE`	browser	Text-to-speech mode (browser/coqui)
`WHISPER_MODEL`	base	Whisper model size
`WHISPER_DEVICE`	cpu	Whisper inference device
`TTS_QUALITY_DEFAULT`	normal	TTS quality preset
`AUTH_TOKEN_TTL`	86400	Token lifetime in seconds
`ECO_MODE`	0	Enable resource-saving mode

Dependencies

The stack is built on open-source components:

Component	Owner	License	Purpose
Ollama	Ollama, Inc.	MIT	Local LLM inference server
Flask	Pallets Projects	BSD-3	Python web framework
Whisper	OpenAI	MIT	Speech recognition model
Coqui TTS	Coqui AI	MPL-2.0	Speech synthesis model
SQLite	SQLite Project	Public Domain	Embedded database
Caddy	Caddy Server	Apache-2.0	TLS reverse proxy
ECharts	Apache Foundation	Apache-2.0	Analytics visualization
MathJax	MathJax Consortium	Apache-2.0	Math equation rendering

Licenses may change. Verify current terms in each upstream repository.