Benchmark Report (Autonomous Claude)

Alternative benchmark analysis with different scoring emphasis and visualizations.

Mobile tip: for the best reading experience, rotate your phone to landscape when reviewing large tables.

Run 9cc182d7 · Darwin arm64 · Ollama 0.13.5

Chat App
Readiness
Report

A full-stack evaluation of nine local LLMs across quality, latency, generation speed, reliability, and server stability — scored for real-world deployment in a human-facing chat application.

Run Date
2026-02-09 → 18
Hardware
Apple arm64
Items Evaluated
636 (current report dataset)
Judge
claude-sonnet-4-6
Quantization
Q4_K_M (8×) + MXFP4
Models Tested
8 · 4B → 20B params
Composite Chat App Score (0 – 100)

Composite Chat App Readiness Score

Four dimensions weighted for a chat app context: Reliability (35% — can the server respond at all?), Latency / TTFT (30% — how fast does the first token appear?), Answer Quality (25% — pass rate from semantic evaluation), and Generation Speed (10% — tokens/sec sustaining a readable stream).

100%
Best Reliability
gemma3:4b & gemma3:12b
11.2s
Fastest Median TTFT
gemma3 family (both sizes)
145 tok/s
Peak Generation Speed
gemma3:4b peak observed
96.9%
Worst Failure Rate
All qwen3 models need stability fixes
Full Scorecard — All Dimensions
RELIABILITY (35%) · LATENCY (30%) · QUALITY (25%) · SPEED (10%) · models sorted by composite score
Composite Score Ranked
Weighted composite score out of 100

Time-to-First-Token: The #1 UX Driver

TTFT is what users perceive as "lag" — the time between hitting Send and seeing the first character appear. Anything under ~15s is acceptable; under 12s is good. A p95 >45s means nearly 1 in 20 conversations feels completely broken. Failures (no response) are plotted at 60s.

Median TTFT by Model
Lower is better · failures mapped to 60 000 ms · sorted fastest first
TTFT Percentiles (p50 / p90 / p95)
Grouped bars — variability matters as much as median
TTFT Distribution — All 636 Responses
Each row = one model · each dot = one response · x-axis = TTFT ms · failures at 60 000 ms (far right, red)
% Responses Under Latency Thresholds
For a chat app: <15s = acceptable · <12s = great · <10s = instant feel
TTFT: Shortform vs. Chat Tasks
Context length difference changes latency — longer chat prompts → higher TTFT
Warmup / Cold-Start Time
Time for first token on model cold-start · relevant for auto-scaling deployments

Tokens per Second: Streaming Feel

After the first token, users experience the stream as "reading speed." >60 tok/s feels instantaneous; 40–60 tok/s is smooth; 20–40 tok/s is readable but noticeable; <20 tok/s feels sluggish for long answers. Speed is constrained by model size and quantization.

Median Tokens/Second
Successful responses only · sorted fastest first
Speed vs. Quality Tradeoff
Scatter: x=median tok/s · y=answer quality % · bubble size=reliability %
Token Generation Speed Distribution (Box Plots)
Min · p25 · median · p75 · max · outliers shown · narrow band = very consistent hardware-bound speed
⚑ Note: qwen3:8b, deepseek-r1:8b, and gpt-oss:20b show extremely narrow TPS bands (±1 tok/s variance) — this suggests hard hardware ceilings, not software throttling. gemma3:4b shows high variance (63–145 tok/s) reflecting output-length dependency.

Can the Server Respond? Stability KPIs

Two failure modes observed: empty responses (Ollama returns HTTP 200 with 0 bytes — likely memory pressure or context overflow) and TTFT timeouts (>45s first-token, process killed). Both are hard failures for a chat app — the user sees nothing.

Response Success Rate
% of requests that produced a valid response · 100% required for production
Failure Mode Breakdown
What type of failure occurred when a response wasn't delivered
Stacked: Pass · Semantic Fail · Format Fail · Server Failure
Full item decomposition per model — server failures (empty response + timeout) are the dark grey · all 636 items shown
Per-Model Reliability Detail
Total requests · successful · empty response failures · TTFT-killed · pass rate

Verbosity, Token Usage & Response Shape

How much a model "talks" directly affects perceived quality and generation time. Concise answers (<50 tokens) feel snappy; verbose answers (>300 tokens) feel thorough but slow. Both extremes can be wrong for a chat app: too terse loses context, too verbose is exhausting to read and slow to stream.

Output Verbosity Profile
Concise <50 tok · Medium 50–299 tok · Verbose ≥300 tok · stacked 100%
Median Output Tokens
Median output token count per model · successful responses only
TTFT vs Output Length Correlation
Scatter per model — does output length affect time-to-first-token? (It shouldn't — TTFT should be independent of output length)

What Users Get: Semantic Correctness

Pass rates from the semantic evaluation across 5 shortform tasks and 3 chat categories. For a chat app, all dimensions matter — a fast but wrong model is worse than a slower correct one.

Overall Answer Quality (Pass Rate)
Includes timeout failures as wrong answers · external judge verdict · sorted best first
Quality Radar — Task Breakdown
Top 4 performers · 5 shortform task axes · 10 items each
Chat Task Performance: Memory · Safety · Instruction Following
Chat-specific evaluation · 10 items per dataset per model · % pass rate

Speed vs. Quality Quadrant Analysis

For a chat app, the ideal model is in the top-right quadrant: fast latency AND high quality. Models in the bottom-right are fast but wrong; top-left are accurate but slow; bottom-left should be avoided entirely.

Deployment Quadrant: Latency Score vs. Answer Quality
x = latency score (inverted TTFT, higher=faster) · y = answer quality (pass %) · bubble = reliability % · quadrant lines at medians
Full UX Radar: Five Dimensions Per Model
Quality · Reliability · Latency · Speed · Instruction Following — normalized 0-100

Model-by-Model Verdicts

Based on the composite evaluation, each model receives a deployment recommendation for a consumer-facing chat application running on Apple arm64 hardware.

Key Findings
Cross-cutting observations from latency, quality, and stability data