Architecture
Technical reference for the local-first AI chat platform.
Mobile tip: for the best reading experience, rotate your phone to landscape when reviewing large tables.
Overview
Local Chat is a self-hosted AI assistant that runs entirely on your machine. The browser loads a single-page app from Flask, which proxies requests to Ollama for LLM inference and optional neural speech services for voice interaction.
- Flask: Python web server with modular blueprints.
- Ollama: Local LLM inference at localhost:11434.
- Whisper: Neural speech-to-text (optional).
- Coqui TTS: Neural text-to-speech (optional).
- SQLite: Analytics database.
- Caddy: TLS reverse proxy for HTTPS.
High-Level Architecture
The system follows a three-tier architecture: presentation (browser SPA), application (Flask API), and data/inference (Ollama + SQLite + neural speech models).
Request Flow
How a user message travels through the system:
- User input: The composer captures text (or transcribed speech) and builds a message array with mode and model selection.
- API call: The browser streams
POST /api/streamwith messages, mode (fast/normal/deep), and optional model override. - Language detection: The backend analyzes the latest user message and tags it with an ISO language code (en/es/fr).
- Prompt assembly: Flask injects a localized system prompt based on mode + language, plus a language guard to ensure reply consistency.
- LLM inference: The request proxies to Ollama's streaming endpoint; tokens arrive chunk by chunk.
- Response cleanup: Hidden
<think>reasoning blocks are stripped in real-time before reaching the client. - Final steps: The UI renders streaming tokens, captures metrics (TTFT, tok/s), and persists the session to disk.
Core Components
Browser SPA — Single-page application served from static/. Handles streaming chat, session sidebar, settings panel, and speech I/O. State persisted in localStorage.
Flask API — Lightweight Python server with modular blueprints. Handles auth, streaming proxy, session CRUD, analytics, and speech endpoints.
Ollama LLM — Local model inference server running on localhost. Streams tokens via /api/chat with curated model allowlist.
Speech-to-Text — Whisper-based neural transcription. Supports browser Web Speech API or server-side Whisper with configurable model/device.
Text-to-Speech — Coqui TTS neural synthesis with quality presets (normal/better/best). Streams chunked audio as NDJSON WAV.
Storage Layer — JSON files for chat sessions, SQLite for analytics, GeoIP database for country lookup, and structured log files.
Edge Proxy — Caddy reverse proxy handles TLS termination, HTTPS certificates, and forwards traffic to Flask on localhost.
Benchmark Subsystem
The benchmark subsystem is a separate execution path from normal chat requests. It runs controlled tasks across models/datasets, persists run state, and exposes read APIs used by docs pages.
- Runner: Executes benchmark scope (models, datasets, tasks), records per-task outcomes, and updates run lifecycle status.
- Run state storage: Persists benchmark telemetry and task status for safe resume/reset behavior.
- Read APIs: Exposes benchmark status/dataset/last-task endpoints consumed by the monitor UI.
- Report layer:
benchmark_guided.htmlandbenchmark_autonomous_claude.htmlprovide post-run analysis views over collected results. - Live monitor:
benchmark_monitor.htmlpolls run state and metrics to visualize progress in near real time.
Operational split: reports are static analysis snapshots; monitor is runtime observability for in-progress runs.
Live Benchmark Monitor
Page: /docs/benchmark_monitor.html. This is a read-only runtime observability console. It is intentionally more complex than the dashboard because it must track in-flight execution, not only aggregated history.
Data channels and cadence
- Run/state channel:
/api/benchmark/statusis polled at high frequency (200msUI loop with request throttling) and is the source of truth for run identity, lifecycle, scope, and rolling metrics. - Dataset/task channel:
/api/benchmark/datasetsevery ~3sfor per-dataset progress and task-level pills. - Last-task channel:
/api/benchmark/last_taskevery ~3sfor detailed prompt/response/evaluation context, with "keep last known task" behavior to avoid UI flicker. - Fallback channels:
/api/gpu(~5s) and/api/temperature(~10s) are used when those samples are missing from benchmark state payloads.
State model and reconciliation
- Run identity gate: monitor keys state by
run_idplus start timestamp and resets charts/pills when a new run or restart is detected. - Progress guards: protects against regressions from transient payloads (e.g., non-monotonic percentage), while still handling explicit reset transitions.
- Staleness detection: running state is downgraded to
NOT RUNNINGif update age exceeds threshold, preventing false "active" UI when backend polling stalls. - Workflow state machine: derives
cooling -> thinking -> streaming -> evaluating -> donefrom server task lifecycle and cooldown metadata, including live timers. - Cross-source merge: combines status payload, dataset payload, and last-task payload into one coherent presentation layer (cards, pills, streaming panel, charts).
UI composition
- Realtime telemetry cards: tokens/sec, file descriptors, CPU, GPU, GPU temperature, disk I/O with bounded local history and adaptive chart scaling.
- Task-list layer: model overview pills, datasets overview pills, and current dataset task pills with status coloring and live-task highlighting.
- Streaming details layer: shows active request text, partial response, evaluation outcome, and transitional states (
WAITING,EVALUATING,DONE). - Operator controls: persisted local preferences for hiding/showing top cards, pills groups, and docs chrome, optimized for long monitoring sessions.
Operational intent: use this page during execution and incident analysis. Use guided/Claude reports for post-run interpretation and decision summaries.
Analytics Dashboard
Page: /dashboard (served by dashboard_routes.py). This is an admin-only analytics surface over recorded HTTP events.
- Access control: route and API are protected by admin allowlist checks; non-admin requests return
403. - Data API:
/api/dashboard/analytics/summarywith optionallimitparameter (default behavior in UI: last 500 events). - Fetch model: snapshot fetch on load and on user-triggered refresh/filter changes (no fixed auto-refresh interval).
- Derived analytics: unique users/IPs, active users (10m), country aggregation, path-group decomposition, per-user activity, and paginated event table.
- Filter composition: country, API group, and username filters can stack; optional "hide local/private IPs" filter rewrites all panel totals from the same filtered subset.
Project Structure
local-chat/
├── app.py # Flask entry point, auth middleware, blueprint registration
├── src/
│ ├── api/ # Route blueprints
│ │ ├── chat_routes.py # Streaming chat, mode prompts, metrics
│ │ ├── session_routes.py # Session CRUD operations
│ │ ├── stt.py # Whisper speech-to-text endpoint
│ │ ├── tts.py # Coqui text-to-speech endpoint
│ │ ├── auth_routes.py # Login/logout, token management
│ │ ├── dashboard_routes.py# Dashboard analytics API
│ │ └── analytics_routes.py# Client action tracking
│ ├── core/ # Configuration and utilities
│ │ ├── config.py # Environment variables, model allowlist
│ │ ├── auth.py # Token storage and validation
│ │ ├── analytics.py # SQLite analytics helpers
│ │ └── logging.py # Structured logging setup
│ ├── services/ # Business logic layer
│ │ ├── session.py # Session storage and persistence
│ │ ├── metadata.py # AI title/summary generation, idle scheduler
│ │ ├── ollama.py # LLM client helpers
│ │ ├── geoip.py # Country lookup service
│ │ └── gpu.py # GPU utilization reading
│ └── audio/ # Speech processing
│ ├── common.py # Shared audio utilities
│ └── tts/ # TTS runtime, chunking, normalization
├── static/ # Frontend assets
│ ├── index.html # Main SPA shell
│ ├── dashboard.html # Analytics dashboard
│ ├── js/ # JavaScript modules
│ ├── css/ # Theme and UI styles
│ └── docs/ # Documentation pages
├── benchmark/ # Benchmark framework and assets
│ ├── run_benchmark.py # Benchmark runner entry point
│ ├── config*.yaml # Benchmark scope/config presets
│ ├── datasets/ # Task datasets (chat + shortform)
│ ├── magistral/ # Benchmark scripts and model-specific runs
│ ├── results/ # Generated benchmark outputs
│ └── archive/ # Historical benchmark docs/experiments
├── chats/ # Session storage (JSON files per user)
├── db/ # Analytics SQLite + GeoIP database
├── log/ # Server logs
└── deploy/ # Deployment scripts and Caddyfile
API Reference
All /api/* endpoints require Bearer token authentication (except /api/login). Streaming endpoints return plain text or NDJSON.
| Endpoint | Method | Purpose |
|---|---|---|
/api/login | POST | Authenticate and receive token |
/api/stream | POST | Stream chat completion |
/api/chat | POST | Non-streaming chat |
/api/stop | POST | Cancel active stream |
/api/metrics | GET | Retrieve stream metrics (TTFT, tok/s, tokens) |
/api/sessions | GET | List user sessions |
/api/session | GET/DELETE | Load or delete session |
/api/save | POST | Persist session to disk |
/api/stt | POST | Transcribe audio |
/api/tts/speak | POST | Synthesize speech (NDJSON audio stream) |
/api/gpu | GET | GPU utilization |
/api/benchmark/status | GET | Current benchmark run status and scope |
/api/benchmark/datasets | GET | Dataset-level progress and task completion summary |
/api/benchmark/last_task | GET | Last executed task details for monitor display |
/api/dashboard/analytics/summary | GET | Admin analytics snapshot for dashboard panels (supports ?limit=) |
/config | GET | Runtime configuration |
/health | GET | Ollama availability check |
Inference Pipeline
The inference pipeline is deterministic and fully local. It assembles a prompt, streams tokens, sanitizes hidden reasoning, and renders rich UI output.
Step objectives:
- Language detection: Keep system prompt in the user's language for stable style.
- System prompt: Enforce mode constraints and block chain-of-thought output.
- Language guard: Prevent drift in multilingual conversations.
- Streaming proxy: Preserve real-time UX while isolating Ollama.
- Sanitization: Remove hidden reasoning before the UI sees it.
- UI render: Convert Markdown + LaTeX into formatted output with MathJax.
- Metrics: Capture TTFT, tokens/sec, and totals for diagnostics.
Mode prompts: Each mode injects a localized system prompt that controls response length and format:
- Fast: "Answer in ≤2 plain sentences" — for quick factual queries.
- Normal: "Reply plainly in ≤8 sentences; no tables" — balanced responses.
- Deep: "Explain thoroughly with brief lists when useful" — detailed explanations.
Model routing: The backend enforces a model allowlist configured via environment variables. Users can select from curated models in the UI, but unauthorized models are rejected.
Neural Speech
Whisper STT: The model slices audio into mel spectrograms, processes them through a Transformer encoder, then decodes tokens autoregressively. Configurable model size (tiny → large), CPU or GPU inference, beam search with temperature, and automatic language detection.
Coqui TTS: Text is normalized and chunked, then a neural vocoder predicts mel spectrograms from character/phoneme embeddings. A decoder converts spectrograms to raw audio waveforms. Quality presets (normal/better/best), multi-language voice selection, and NDJSON streaming for low latency.
| Mode | STT Path | TTS Path | Latency |
|---|---|---|---|
| Browser | Web Speech API | SpeechSynthesis API | ~50ms |
| Server | Whisper neural model | Coqui neural vocoder | ~500-2000ms |
Data Model
Session Object: Each chat session is stored as a JSON file:
{
"id": "abc123def456",
"owner": "username",
"title": "User-defined title",
"title_ai": "AI-generated title",
"summary_ai": "Brief conversation summary",
"pinned": false,
"updated_at": "2025-12-30T10:30:00+00:00",
"messages": [
{ "role": "user", "content": "Hello!", "language_ai": "en" },
{ "role": "assistant", "content": "Hi there! How can I help?" }
]
}
Analytics Schema: Request telemetry logged to SQLite:
CREATE TABLE analytics (
id INTEGER PRIMARY KEY,
ts TEXT, -- ISO timestamp
username TEXT, -- Authenticated user
method TEXT, -- HTTP method
path TEXT, -- Request path
ip TEXT, -- Client IP
country TEXT, -- GeoIP country code
user_agent TEXT, -- Raw UA string
ua_browser TEXT, -- Parsed browser
ua_os TEXT, -- Parsed OS
ua_device TEXT, -- Device type
group_label TEXT, -- Action category
subgroup_label TEXT -- Action detail
);
Model Catalog
The UI exposes a curated selection of LLMs optimized for different use cases:
| Model | Parameters | Architecture | Owner | Best For |
|---|---|---|---|---|
gemma3:4b | 4.3B | Dense | Speed-critical interactive chat | |
deepseek-r1:8b | 8.2B | Dense | DeepSeek | Reasoning and analysis |
qwen3:8b | 8.2B | Dense | Alibaba | Balanced general-purpose |
gpt-oss:20b | 20.9B | MoE | OpenAI | Quality content generation |
magistral:24b | 23.6B | Dense | Mistral AI | Multilingual specialist |
Security Model
- Authentication: Basic auth or Bearer tokens with configurable TTL. Tokens stored in-memory with automatic expiration. Guest mode available with rate limits.
- Authorization: Admin-only routes for analytics dashboard. Non-admin users restricted to allowed modes and models. Per-user chat isolation.
- Local Inference: Ollama binds to localhost only. No data leaves the machine unless explicitly configured. All models run on-device.
- TLS Termination: Caddy handles HTTPS with automatic certificate management. All external traffic encrypted in transit.
Configuration
Key environment variables for customizing deployment:
| Variable | Default | Description |
|---|---|---|
OLLAMA_URL | http://127.0.0.1:11434 | Ollama server endpoint |
MODEL | gpt-oss:20b | Default LLM model |
SUMMARY_MODEL | gpt-oss:20b | Model for metadata generation |
STT_MODE | browser | Speech-to-text mode (browser/whisper) |
TTS_MODE | browser | Text-to-speech mode (browser/coqui) |
WHISPER_MODEL | base | Whisper model size |
WHISPER_DEVICE | cpu | Whisper inference device |
TTS_QUALITY_DEFAULT | normal | TTS quality preset |
AUTH_TOKEN_TTL | 86400 | Token lifetime in seconds |
ECO_MODE | 0 | Enable resource-saving mode |
Dependencies
The stack is built on open-source components:
| Component | Owner | License | Purpose |
|---|---|---|---|
| Ollama | Ollama, Inc. | MIT | Local LLM inference server |
| Flask | Pallets Projects | BSD-3 | Python web framework |
| Whisper | OpenAI | MIT | Speech recognition model |
| Coqui TTS | Coqui AI | MPL-2.0 | Speech synthesis model |
| SQLite | SQLite Project | Public Domain | Embedded database |
| Caddy | Caddy Server | Apache-2.0 | TLS reverse proxy |
| ECharts | Apache Foundation | Apache-2.0 | Analytics visualization |
| MathJax | MathJax Consortium | Apache-2.0 | Math equation rendering |
Licenses may change. Verify current terms in each upstream repository.