Benchmark Report (Autonomous Claude)

Alternative benchmark analysis with different scoring emphasis and visualizations.

Mobile tip: for the best reading experience, rotate your phone to landscape when reviewing large tables.

Run 9cc182d7 · Darwin arm64 · Ollama 0.13.5

Chat App
Readiness
Report

A full-stack evaluation of nine local LLMs across quality, latency, generation speed, reliability, and server stability — scored for real-world deployment in a human-facing chat application.

Run Date

2026-02-09 → 18

Hardware

Apple arm64

Items Evaluated

636 (current report dataset)

Judge

claude-sonnet-4-6

Quantization

Q4_K_M (8×) + MXFP4

Models Tested

8 · 4B → 20B params

Composite Chat App Score (0 – 100)

§ 01 — Overall Verdict

Composite Chat App Readiness Score

Four dimensions weighted for a chat app context: Reliability (35% — can the server respond at all?), Latency / TTFT (30% — how fast does the first token appear?), Answer Quality (25% — pass rate from semantic evaluation), and Generation Speed (10% — tokens/sec sustaining a readable stream).

100%

Best Reliability

gemma3:4b & gemma3:12b

11.2s

Fastest Median TTFT

gemma3 family (both sizes)

145 tok/s

Peak Generation Speed

gemma3:4b peak observed

96.9%

Worst Failure Rate

All qwen3 models need stability fixes

Full Scorecard — All Dimensions

RELIABILITY (35%) · LATENCY (30%) · QUALITY (25%) · SPEED (10%) · models sorted by composite score

Composite Score Ranked

Weighted composite score out of 100

§ 02 — First Token Latency (TTFT)

Time-to-First-Token: The #1 UX Driver

TTFT is what users perceive as "lag" — the time between hitting Send and seeing the first character appear. Anything under ~15s is acceptable; under 12s is good. A p95 >45s means nearly 1 in 20 conversations feels completely broken. Failures (no response) are plotted at 60s.

Median TTFT by Model

Lower is better · failures mapped to 60 000 ms · sorted fastest first

TTFT Percentiles (p50 / p90 / p95)

Grouped bars — variability matters as much as median

TTFT Distribution — All 636 Responses

Each row = one model · each dot = one response · x-axis = TTFT ms · failures at 60 000 ms (far right, red)

% Responses Under Latency Thresholds

For a chat app: <15s = acceptable · <12s = great · <10s = instant feel

TTFT: Shortform vs. Chat Tasks

Context length difference changes latency — longer chat prompts → higher TTFT

Warmup / Cold-Start Time

Time for first token on model cold-start · relevant for auto-scaling deployments

§ 03 — Generation Speed

Tokens per Second: Streaming Feel

After the first token, users experience the stream as "reading speed." >60 tok/s feels instantaneous; 40–60 tok/s is smooth; 20–40 tok/s is readable but noticeable; <20 tok/s feels sluggish for long answers. Speed is constrained by model size and quantization.

Median Tokens/Second

Successful responses only · sorted fastest first

Speed vs. Quality Tradeoff

Scatter: x=median tok/s · y=answer quality % · bubble size=reliability %

Token Generation Speed Distribution (Box Plots)

Min · p25 · median · p75 · max · outliers shown · narrow band = very consistent hardware-bound speed

⚑ Note: qwen3:8b, deepseek-r1:8b, and gpt-oss:20b show extremely narrow TPS bands (±1 tok/s variance) — this suggests hard hardware ceilings, not software throttling. gemma3:4b shows high variance (63–145 tok/s) reflecting output-length dependency.

§ 04 — Server Stability & Reliability

Can the Server Respond? Stability KPIs

Two failure modes observed: empty responses (Ollama returns HTTP 200 with 0 bytes — likely memory pressure or context overflow) and TTFT timeouts (>45s first-token, process killed). Both are hard failures for a chat app — the user sees nothing.

Response Success Rate

% of requests that produced a valid response · 100% required for production

Failure Mode Breakdown

What type of failure occurred when a response wasn't delivered

Stacked: Pass · Semantic Fail · Format Fail · Server Failure

Full item decomposition per model — server failures (empty response + timeout) are the dark grey · all 636 items shown

Per-Model Reliability Detail

Total requests · successful · empty response failures · TTFT-killed · pass rate

§ 05 — Output Behaviour

Verbosity, Token Usage & Response Shape

How much a model "talks" directly affects perceived quality and generation time. Concise answers (<50 tokens) feel snappy; verbose answers (>300 tokens) feel thorough but slow. Both extremes can be wrong for a chat app: too terse loses context, too verbose is exhausting to read and slow to stream.

Output Verbosity Profile

Concise <50 tok · Medium 50–299 tok · Verbose ≥300 tok · stacked 100%

Median Output Tokens

Median output token count per model · successful responses only

TTFT vs Output Length Correlation

Scatter per model — does output length affect time-to-first-token? (It shouldn't — TTFT should be independent of output length)

§ 06 — Answer Quality (External Judge)

What Users Get: Semantic Correctness

Pass rates from the semantic evaluation across 5 shortform tasks and 3 chat categories. For a chat app, all dimensions matter — a fast but wrong model is worse than a slower correct one.

Overall Answer Quality (Pass Rate)

Includes timeout failures as wrong answers · external judge verdict · sorted best first

Quality Radar — Task Breakdown

Top 4 performers · 5 shortform task axes · 10 items each

Chat Task Performance: Memory · Safety · Instruction Following

Chat-specific evaluation · 10 items per dataset per model · % pass rate

§ 07 — Deployment Decision Matrix

Speed vs. Quality Quadrant Analysis

For a chat app, the ideal model is in the top-right quadrant: fast latency AND high quality. Models in the bottom-right are fast but wrong; top-left are accurate but slow; bottom-left should be avoided entirely.

Deployment Quadrant: Latency Score vs. Answer Quality

x = latency score (inverted TTFT, higher=faster) · y = answer quality (pass %) · bubble = reliability % · quadrant lines at medians

Full UX Radar: Five Dimensions Per Model

Quality · Reliability · Latency · Speed · Instruction Following — normalized 0-100

§ 08 — Deployment Recommendations

Model-by-Model Verdicts

Based on the composite evaluation, each model receives a deployment recommendation for a consumer-facing chat application running on Apple arm64 hardware.

Key Findings

Cross-cutting observations from latency, quality, and stability data

Chat App Readiness Report