Architecture

Technical reference for the local-first AI chat platform.

Mobile tip: for the best reading experience, rotate your phone to landscape when reviewing large tables.

Overview

Local Chat is a self-hosted AI assistant that runs entirely on your machine. The browser loads a single-page app from Flask, which proxies requests to Ollama for LLM inference and optional neural speech services for voice interaction.

High-Level Architecture

The system follows a three-tier architecture: presentation (browser SPA), application (Flask API), and data/inference (Ollama + SQLite + neural speech models).

Presentation Browser SPA Streaming chat UI Session management Settings, themes Web Speech, MathJax Markdown rendering Caddy Proxy TLS termination HTTPS certificates Reverse proxy → Flask backend HTTPS Application Flask API Server Authentication middleware Streaming chat proxy Session CRUD Mode prompts Language detection Analytics capture STT/TTS orchestration Metadata scheduler Analytics dashboard GPU utilization GeoIP lookup Rate limiting Data & Inference Ollama LLM 6 curated models Streaming /api/chat Model allowlist localhost:11434 Neural Speech Whisper STT Coqui TTS Quality presets Multi-language Storage chats/*.json sessions SQLite analytics GeoIP database Structured logs LLM Speech Data

Request Flow

How a user message travels through the system:

  1. User input: The composer captures text (or transcribed speech) and builds a message array with mode and model selection.
  2. API call: The browser streams POST /api/stream with messages, mode (fast/normal/deep), and optional model override.
  3. Language detection: The backend analyzes the latest user message and tags it with an ISO language code (en/es/fr).
  4. Prompt assembly: Flask injects a localized system prompt based on mode + language, plus a language guard to ensure reply consistency.
  5. LLM inference: The request proxies to Ollama's streaming endpoint; tokens arrive chunk by chunk.
  6. Response cleanup: Hidden <think> reasoning blocks are stripped in real-time before reaching the client.
  7. Final steps: The UI renders streaming tokens, captures metrics (TTFT, tok/s), and persists the session to disk.

Core Components

Browser SPA — Single-page application served from static/. Handles streaming chat, session sidebar, settings panel, and speech I/O. State persisted in localStorage.

Flask API — Lightweight Python server with modular blueprints. Handles auth, streaming proxy, session CRUD, analytics, and speech endpoints.

Ollama LLM — Local model inference server running on localhost. Streams tokens via /api/chat with curated model allowlist.

Speech-to-Text — Whisper-based neural transcription. Supports browser Web Speech API or server-side Whisper with configurable model/device.

Text-to-Speech — Coqui TTS neural synthesis with quality presets (normal/better/best). Streams chunked audio as NDJSON WAV.

Storage Layer — JSON files for chat sessions, SQLite for analytics, GeoIP database for country lookup, and structured log files.

Edge Proxy — Caddy reverse proxy handles TLS termination, HTTPS certificates, and forwards traffic to Flask on localhost.

Benchmark Subsystem

The benchmark subsystem is a separate execution path from normal chat requests. It runs controlled tasks across models/datasets, persists run state, and exposes read APIs used by docs pages.

Operational split: reports are static analysis snapshots; monitor is runtime observability for in-progress runs.

Live Benchmark Monitor

Page: /docs/benchmark_monitor.html. This is a read-only runtime observability console. It is intentionally more complex than the dashboard because it must track in-flight execution, not only aggregated history.

Data channels and cadence

State model and reconciliation

UI composition

Operational intent: use this page during execution and incident analysis. Use guided/Claude reports for post-run interpretation and decision summaries.

User guide: Live Benchmark Monitor

Analytics Dashboard

Page: /dashboard (served by dashboard_routes.py). This is an admin-only analytics surface over recorded HTTP events.

User guide: Analytics Dashboard

Project Structure

local-chat/
├── app.py                     # Flask entry point, auth middleware, blueprint registration
├── src/
│   ├── api/                   # Route blueprints
│   │   ├── chat_routes.py     # Streaming chat, mode prompts, metrics
│   │   ├── session_routes.py  # Session CRUD operations
│   │   ├── stt.py             # Whisper speech-to-text endpoint
│   │   ├── tts.py             # Coqui text-to-speech endpoint
│   │   ├── auth_routes.py     # Login/logout, token management
│   │   ├── dashboard_routes.py# Dashboard analytics API
│   │   └── analytics_routes.py# Client action tracking
│   ├── core/                  # Configuration and utilities
│   │   ├── config.py          # Environment variables, model allowlist
│   │   ├── auth.py            # Token storage and validation
│   │   ├── analytics.py       # SQLite analytics helpers
│   │   └── logging.py         # Structured logging setup
│   ├── services/              # Business logic layer
│   │   ├── session.py         # Session storage and persistence
│   │   ├── metadata.py        # AI title/summary generation, idle scheduler
│   │   ├── ollama.py          # LLM client helpers
│   │   ├── geoip.py           # Country lookup service
│   │   └── gpu.py             # GPU utilization reading
│   └── audio/                 # Speech processing
│       ├── common.py          # Shared audio utilities
│       └── tts/               # TTS runtime, chunking, normalization
├── static/                    # Frontend assets
│   ├── index.html             # Main SPA shell
│   ├── dashboard.html         # Analytics dashboard
│   ├── js/                    # JavaScript modules
│   ├── css/                   # Theme and UI styles
│   └── docs/                  # Documentation pages
├── benchmark/                 # Benchmark framework and assets
│   ├── run_benchmark.py       # Benchmark runner entry point
│   ├── config*.yaml           # Benchmark scope/config presets
│   ├── datasets/              # Task datasets (chat + shortform)
│   ├── magistral/             # Benchmark scripts and model-specific runs
│   ├── results/               # Generated benchmark outputs
│   └── archive/               # Historical benchmark docs/experiments
├── chats/                     # Session storage (JSON files per user)
├── db/                        # Analytics SQLite + GeoIP database
├── log/                       # Server logs
└── deploy/                    # Deployment scripts and Caddyfile

API Reference

All /api/* endpoints require Bearer token authentication (except /api/login). Streaming endpoints return plain text or NDJSON.

EndpointMethodPurpose
/api/loginPOSTAuthenticate and receive token
/api/streamPOSTStream chat completion
/api/chatPOSTNon-streaming chat
/api/stopPOSTCancel active stream
/api/metricsGETRetrieve stream metrics (TTFT, tok/s, tokens)
/api/sessionsGETList user sessions
/api/sessionGET/DELETELoad or delete session
/api/savePOSTPersist session to disk
/api/sttPOSTTranscribe audio
/api/tts/speakPOSTSynthesize speech (NDJSON audio stream)
/api/gpuGETGPU utilization
/api/benchmark/statusGETCurrent benchmark run status and scope
/api/benchmark/datasetsGETDataset-level progress and task completion summary
/api/benchmark/last_taskGETLast executed task details for monitor display
/api/dashboard/analytics/summaryGETAdmin analytics snapshot for dashboard panels (supports ?limit=)
/configGETRuntime configuration
/healthGETOllama availability check

Inference Pipeline

The inference pipeline is deterministic and fully local. It assembles a prompt, streams tokens, sanitizes hidden reasoning, and renders rich UI output.

User Messages Conversation history Mode + model choices Language hints Prompt Assembly Detect request language System prompt (mode) Language guard Streaming Inference /api/stream → Ollama Chunked tokens /api/metrics stats Output Sanitization Strip <think> blocks Normalize whitespace Safe display text UI Render Markdown → HTML LaTeX → MathJax Streaming diff + layout

Step objectives:

Mode prompts: Each mode injects a localized system prompt that controls response length and format:

Model routing: The backend enforces a model allowlist configured via environment variables. Users can select from curated models in the UI, but unauthorized models are rejected.

Neural Speech

Whisper STT: The model slices audio into mel spectrograms, processes them through a Transformer encoder, then decodes tokens autoregressively. Configurable model size (tiny → large), CPU or GPU inference, beam search with temperature, and automatic language detection.

Coqui TTS: Text is normalized and chunked, then a neural vocoder predicts mel spectrograms from character/phoneme embeddings. A decoder converts spectrograms to raw audio waveforms. Quality presets (normal/better/best), multi-language voice selection, and NDJSON streaming for low latency.

ModeSTT PathTTS PathLatency
BrowserWeb Speech APISpeechSynthesis API~50ms
ServerWhisper neural modelCoqui neural vocoder~500-2000ms

Data Model

Session Object: Each chat session is stored as a JSON file:

{
  "id": "abc123def456",
  "owner": "username",
  "title": "User-defined title",
  "title_ai": "AI-generated title",
  "summary_ai": "Brief conversation summary",
  "pinned": false,
  "updated_at": "2025-12-30T10:30:00+00:00",
  "messages": [
    { "role": "user", "content": "Hello!", "language_ai": "en" },
    { "role": "assistant", "content": "Hi there! How can I help?" }
  ]
}

Analytics Schema: Request telemetry logged to SQLite:

CREATE TABLE analytics (
  id INTEGER PRIMARY KEY,
  ts TEXT,              -- ISO timestamp
  username TEXT,        -- Authenticated user
  method TEXT,          -- HTTP method
  path TEXT,            -- Request path
  ip TEXT,              -- Client IP
  country TEXT,         -- GeoIP country code
  user_agent TEXT,      -- Raw UA string
  ua_browser TEXT,      -- Parsed browser
  ua_os TEXT,           -- Parsed OS
  ua_device TEXT,       -- Device type
  group_label TEXT,     -- Action category
  subgroup_label TEXT   -- Action detail
);

Model Catalog

The UI exposes a curated selection of LLMs optimized for different use cases:

ModelParametersArchitectureOwnerBest For
gemma3:4b4.3BDenseGoogleSpeed-critical interactive chat
deepseek-r1:8b8.2BDenseDeepSeekReasoning and analysis
qwen3:8b8.2BDenseAlibabaBalanced general-purpose
gpt-oss:20b20.9BMoEOpenAIQuality content generation
magistral:24b23.6BDenseMistral AIMultilingual specialist

Security Model

Configuration

Key environment variables for customizing deployment:

VariableDefaultDescription
OLLAMA_URLhttp://127.0.0.1:11434Ollama server endpoint
MODELgpt-oss:20bDefault LLM model
SUMMARY_MODELgpt-oss:20bModel for metadata generation
STT_MODEbrowserSpeech-to-text mode (browser/whisper)
TTS_MODEbrowserText-to-speech mode (browser/coqui)
WHISPER_MODELbaseWhisper model size
WHISPER_DEVICEcpuWhisper inference device
TTS_QUALITY_DEFAULTnormalTTS quality preset
AUTH_TOKEN_TTL86400Token lifetime in seconds
ECO_MODE0Enable resource-saving mode

Dependencies

The stack is built on open-source components:

ComponentOwnerLicensePurpose
OllamaOllama, Inc.MITLocal LLM inference server
FlaskPallets ProjectsBSD-3Python web framework
WhisperOpenAIMITSpeech recognition model
Coqui TTSCoqui AIMPL-2.0Speech synthesis model
SQLiteSQLite ProjectPublic DomainEmbedded database
CaddyCaddy ServerApache-2.0TLS reverse proxy
EChartsApache FoundationApache-2.0Analytics visualization
MathJaxMathJax ConsortiumApache-2.0Math equation rendering

Licenses may change. Verify current terms in each upstream repository.