Benchmark Report (Autonomous Claude)
Alternative benchmark analysis with different scoring emphasis and visualizations.
Mobile tip: for the best reading experience, rotate your phone to landscape when reviewing large tables.
Chat App
Readiness
Report
A full-stack evaluation of nine local LLMs across quality, latency, generation speed, reliability, and server stability — scored for real-world deployment in a human-facing chat application.
Composite Chat App Readiness Score
Four dimensions weighted for a chat app context: Reliability (35% — can the server respond at all?), Latency / TTFT (30% — how fast does the first token appear?), Answer Quality (25% — pass rate from semantic evaluation), and Generation Speed (10% — tokens/sec sustaining a readable stream).
Time-to-First-Token: The #1 UX Driver
TTFT is what users perceive as "lag" — the time between hitting Send and seeing the first character appear. Anything under ~15s is acceptable; under 12s is good. A p95 >45s means nearly 1 in 20 conversations feels completely broken. Failures (no response) are plotted at 60s.
Tokens per Second: Streaming Feel
After the first token, users experience the stream as "reading speed." >60 tok/s feels instantaneous; 40–60 tok/s is smooth; 20–40 tok/s is readable but noticeable; <20 tok/s feels sluggish for long answers. Speed is constrained by model size and quantization.
Can the Server Respond? Stability KPIs
Two failure modes observed: empty responses (Ollama returns HTTP 200 with 0 bytes — likely memory pressure or context overflow) and TTFT timeouts (>45s first-token, process killed). Both are hard failures for a chat app — the user sees nothing.
Verbosity, Token Usage & Response Shape
How much a model "talks" directly affects perceived quality and generation time. Concise answers (<50 tokens) feel snappy; verbose answers (>300 tokens) feel thorough but slow. Both extremes can be wrong for a chat app: too terse loses context, too verbose is exhausting to read and slow to stream.
What Users Get: Semantic Correctness
Pass rates from the semantic evaluation across 5 shortform tasks and 3 chat categories. For a chat app, all dimensions matter — a fast but wrong model is worse than a slower correct one.
Speed vs. Quality Quadrant Analysis
For a chat app, the ideal model is in the top-right quadrant: fast latency AND high quality. Models in the bottom-right are fast but wrong; top-left are accurate but slow; bottom-left should be avoided entirely.
Model-by-Model Verdicts
Based on the composite evaluation, each model receives a deployment recommendation for a consumer-facing chat application running on Apple arm64 hardware.