Benchmark Report (Guided)

Run ID9cc182d7...

Date2026-02-09

Run Statuscompleted

Models8

Tasks640

Ollamaollama version is 0.13.5

Abstract

This report presents a scientific evaluation of run 9cc182d7-74c0-4ac2-a0eb-3ed86afd142b using the following analysis axes: responsiveness, streaming quality, chat accuracy, task versatility, and thermal stability.

Task versatility is decomposed into Math, Reasoning, and Instruction, yielding a 7-axis comparative radar profile per model.

The final distribution is 401/640 passed (62.7%) and 239/640 failed (37.3%) after GPT Codex 5.3 semantic re-judging of shortform samples.

Key Findings

8

Models Evaluated

640

Total Tasks

54.5%

Avg Shortform Accuracy

62.7%

Global Pass Rate

Principal Conclusions

Best composite model: gemma3:4b with fitness 88.7/100.
Fastest response: gemma3:4b at 11.2s mean TTFT.
Highest throughput: gemma3:4b at 89.3 tok/s.
Highest task reliability: gemma3:12b with 77.5% pass rate.

1. Introduction

1.1 Motivation

The objective is to derive operational model rankings from observed benchmark behavior under a unified local-inference environment.

1.2 Scope

Scope includes eight models over eight datasets (five shortform + three chat), totaling 640 model-task evaluations.

1.3 Experimental Endpoint

All tasks in scope are terminally labeled and included in comparative statistics.

1.4 Live Benchmark Monitor Companion

The companion page benchmark_monitor.html is a read-only, database-backed monitor for ongoing runs and is the fastest way to understand benchmark execution in real time: it shows run scope and state (models/datasets/tasks), rolling performance telemetry (TTFT, tok/s, CPU, GPU utilization, GPU temperature, disk I/O), dataset-level completion with per-task pills that switch from progress to pass-rate as each dataset closes, and a detailed “Last Task Executed” panel with workflow stage, prompt, streaming response, evaluation result, and task KPIs; it also handles resets/resume correctly by tracking run identity and live task events, so readers can trust what they see even when a run restarts, resumes, or is only partially completed. Open it directly from this report at Live Benchmark Monitor.

2. Methodology

2.1 Five Principal Axes

⚡

Axis 1: Responsiveness (TTFT)

Lower TTFT maps to higher normalized score.

🌊

Axis 2: Streaming Quality (tok/s)

Higher sustained token throughput maps to higher score.

💬

Axis 3: Chat Accuracy

Measured as pass rate over chat_instruction, chat_memory, and chat_safety tracks.

🎯

Axis 4: Task Versatility

Decomposed into Math (cs+physics), Reasoning (critical+logic), and Instruction (chat_instruction).

🌡️

Axis 5: Thermal Stability

Lower mean GPU temperature maps to higher normalized thermal score.

2.2 Scoring Formula

Composite fitness follows the benchmark family weighting:

fitness = 0.50 × chat_ux + 0.30 × speed + 0.20 × shortform_quality

where speed = 0.6 × TTFT_score + 0.4 × throughput_score.

2.3 Hardware Configuration

Component	Specification
Device	Mac mini M4 Pro
CPU	Apple M4 Pro, 12-core
GPU	16-core integrated
Unified RAM	24 GB
Inference Engine	ollama version is 0.13.5

2.4 Model Registry

Model	Origin	Params	Quant	Architecture	Warmup TTFT
`deepseek-r1:14b`	DeepSeek	14.1B	Q4_K_M	Reasoning-heavy	11.1s
`deepseek-r1:8b`	DeepSeek	8.2B	Q4_K_M	Reasoning-heavy	7.7s
`gemma3:12b`	Google	12.2B	Q4_K_M	Direct-response	9.5s
`gemma3:4b`	Google	4.3B	Q4_K_M	Direct-response	5.6s
`gpt-oss:20b`	OpenAI	20.9B	MXFP4	Reasoning-heavy	11.2s
`qwen3:14b`	Alibaba	14.5B	Q4_K_M	Reasoning-heavy	9.5s
`qwen3:4b`	Alibaba	4.0B	Q4_K_M	Reasoning-heavy	6.5s
`qwen3:8b`	Alibaba	8.2B	Q4_K_M	Reasoning-heavy	7.8s

2.5 Dataset Inventory

Dataset	Kind	Total	Passed	Failed	Pass Rate
chat_instruction	chat	80	58	22	72.5%
chat_memory	chat	80	67	13	83.8%
chat_safety	chat	80	58	22	72.5%
critical_thinking	shortform	80	22	58	27.5%
cs_engineering	shortform	80	41	39	51.2%
history	shortform	80	52	28	65.0%
logic_deduction	shortform	80	57	23	71.2%
physics	shortform	80	46	34	57.5%

3. Results: Short-Form Evaluation

3.1 Per-Model Accuracy

Rank	Model	Pass	Fail	Shortform	Chat UX	TTFT	TPS	Temp	Fitness
1	`gemma3:4b`	72.5%	27.5%	60.0%	93.3%	11.2s	89.3	45.3°C	88.7
2	`gemma3:12b`	77.5%	22.5%	68.0%	93.3%	11.3s	38.7	45.1°C	81.6
3	`deepseek-r1:8b`	65.0%	35.0%	54.0%	83.3%	18.4s	36.6	78.5°C	65.0
4	`qwen3:8b`	61.2%	38.8%	58.0%	66.7%	16.2s	34.5	72.3°C	59.8
5	`qwen3:4b`	56.2%	43.8%	54.0%	60.0%	15.4s	49.9	68.6°C	59.2
6	`gpt-oss:20b`	70.0%	30.0%	64.0%	80.0%	25.4s	41.4	58.1°C	57.8
7	`deepseek-r1:14b`	41.2%	58.8%	28.0%	63.3%	19.8s	20.0	81.0°C	45.4
8	`qwen3:14b`	57.5%	42.5%	50.0%	70.0%	26.3s	17.7	82.6°C	45.0

3.2 Speed Metrics

Average TTFT (lower is better)

gemma3:4b

11.2s

gemma3:12b

11.3s

qwen3:4b

15.4s

qwen3:8b

16.2s

deepseek-r1:8b

18.4s

deepseek-r1:14b

19.8s

gpt-oss:20b

25.4s

qwen3:14b

26.3s

4. Seven-Axis Radar Visualization

The radar chart uses this report's seven-axis family: TTFT, Throughput, Math, Reasoning, Instruction, Chat UX, and Thermal. All values are normalized to 0-100.

5. Failure Analysis

5.1 KO Concentration

Dominant KO datasets

Failure mass is concentrated in critical_thinking, cs_engineering, and physics.

5.2 Failure Summary by Dataset

Dataset	KO Count	KO Rate	Total
critical_thinking	58	72.5%	80
cs_engineering	39	48.8%	80
physics	34	42.5%	80
history	28	35.0%	80
logic_deduction	23	28.8%	80
chat_instruction	22	27.5%	80
chat_safety	22	27.5%	80
chat_memory	13	16.2%	80

6. Chat UX Track Results

6.1 Turn Compliance Rate (Task-Level)

Model	Tasks	Chat %	TTFT	tok/s
`gemma3:12b`	30	93.3%	11.3s	38.7
`gemma3:4b`	30	93.3%	11.2s	89.3
`deepseek-r1:8b`	30	83.3%	18.4s	36.6
`gpt-oss:20b`	30	80.0%	25.4s	41.4
`qwen3:14b`	30	70.0%	26.3s	17.7
`qwen3:8b`	30	66.7%	16.2s	34.5
`deepseek-r1:14b`	30	63.3%	19.8s	20.0
`qwen3:4b`	30	60.0%	15.4s	49.9

6.2 Chat TTFT Distribution

Observed model means span 11.2s to 26.3s in TTFT.

7. Thermal Analysis

Average GPU Temperature (lower is better)

gemma3:12b

45.1°C

gemma3:4b

45.3°C

gpt-oss:20b

58.1°C

qwen3:4b

68.6°C

qwen3:8b

72.3°C

deepseek-r1:8b

78.5°C

deepseek-r1:14b

81.0°C

qwen3:14b

82.6°C

Thermal spread is 45.1°C to 82.6°C across evaluated models.

8. Composite Rankings

8.1 Efficiency Frontier

The efficiency frontier is defined by models that preserve high chat quality with lower latency and manageable thermal load under the fitness objective.

8.2 Recommendations by Use Case

Use Case	Recommended Model	Rationale
Balanced Production	`gemma3:4b`	Highest composite fitness (88.7).
Lowest Latency	`gemma3:4b`	Lowest mean TTFT (11.2s).
Highest Throughput	`gemma3:4b`	Highest mean output rate (89.3 tok/s).
Highest Reliability	`gemma3:12b`	Best global pass rate (77.5%).

9. Chain-of-Thought (CoT) Analysis

This benchmark is explicitly oriented to the application use case: fast, chat-friendly interaction. In that context, CoT behavior is not a side detail; it directly affects first-token latency and perceived responsiveness.

9.1 Behavioral Mechanism

As characterized in benchmark/test_thinking.py, some models emit hidden reasoning before visible output. This produces an architecture-level latency overhead: the user sees the first token only after internal reasoning finishes.

Concretely, benchmark/test_thinking.py sends a deterministic two-message chat request to each model (system: final-answer-only, user: "What is 2+2?") and inspects the Ollama response fields. It records: eval_count (generated tokens), end-to-end elapsed time, derived tokens/sec, visible message.content length, hidden message.thinking length, and the thinking/content character split. If message.thinking is non-empty, the model is classified as using a reasoning-first response mode in this probe.

This is a behavioral probe, not a quality benchmark: it is intentionally small and controlled to isolate response mode effects. It does not grade task correctness breadth, nor does it estimate full benchmark variance across prompt families.

9.2 Response-Mode Interpretation

Model Families in This Run

Direct-response profile

gemma3:4b, gemma3:12b show the lowest TTFT in this run and the highest chat UX outcomes, consistent with chat-first serving behavior.

Reasoning-heavy profile

qwen3:*, deepseek-r1:*, and gpt-oss:20b can preserve useful reasoning quality but incur larger TTFT overhead in interactive chat settings.

9.3 Implications for Metrics in a Fast-Chat Benchmark

Metric	Interpretation Under CoT Behavior
TTFT	Most sensitive metric for chat UX; hidden reasoning inflates first-token delay.
Tokens/sec	Still valid for generation throughput, but does not cancel first-token waiting cost.
Wall-clock response feel	Dominated by TTFT in chat flows; users perceive delay before any visible answer.
Accuracy	Reasoning overhead may help on harder tasks, but can underperform in chat-first utility when latency is prioritized.

9.4 Benchmark Positioning

This report should be read as a use-case benchmark, not a universal intelligence ranking. The scoring favors models that are both correct and immediately responsive for conversational product UX. A model can be strong in deeper reasoning workloads and still rank lower here if its response mode is slower.

9.5 Practical Reading of Results

For this app, lower TTFT and high chat compliance are primary deployment criteria. The CoT framing explains why some reasoning-heavy models underperform in this benchmark despite competitive quality on non-chat workloads.

10. Magistral Case Study (Excluded from Final Ranking)

magistral:24b was included in our local benchmark and ad-hoc test programs because it is present in our app model catalog, performs well in many interactive sessions, and is a strategically relevant European model family to evaluate in real operating conditions.

In this report, Magistral is documented as a dedicated case study because its behavior diverged across evaluation regimes. The final ranking exclusion is therefore presented after evidence review, and is treated as a hardware-profile reliability decision, not a general rejection of model potential.

Why this is a surprise

Interactive app usage and benchmark stress workload are different operating regimes. A model can feel strong in user-driven chat sessions but still collapse under sustained, back-to-back hard benchmark prompts with strict timeout and fairness controls.

10.1 Quantitative Evidence (DB-backed)

Aggregated benchmark observations show magistral:24b has non-stationary behavior on this hardware profile: it can produce acceptable outcomes in some benchmark executions, and collapse in others.

Observed regime	Interpretation
Stable execution episodes	Model completes substantial workload with usable quality.
Collapse episodes	Systematic timeout/hang patterns dominate and break comparability.
Intermediate episodes	Partial completion with elevated error density and degraded throughput.

10.2 Failure Modes Detected

On a subset of hard logic/math prompts, generation can enter long non-productive loops.
These events correlate with timeout/connection failures and unstable endpoint behavior.
Loop episodes can degrade Ollama stability and require manual unload/restart for recovery.
Performance is substantially better on lower-complexity prompts, indicating a difficulty-dependent failure profile.

Minimal isolation tests reproduced the same pattern, so this is not attributed to benchmark-framework orchestration. Operationally, these cases exceed user-friendly latency bounds.

10.3 Error Signatures

HTTPConnectionPool(...): Read timed out. (read timeout=60)
RemoteDisconnected('Remote end closed connection without response')
Max retries exceeded ... /api/stream after instability propagation

10.4 Thermal Risk Observation

During the latest stress reproduction, direct sensor checks reported: GPU temperature: 97.6°C, GPU utilization: 100%, and panel telemetry showing hottest GPU around 97.7°C with average GPU around 92.2°C. This is outside a comfortable sustained operating envelope for user-facing reliability.

10.5 Interpretation

The evidence indicates a difficulty-dependent reliability instability on this hardware profile. The model can provide acceptable outputs on lower-complexity prompts, but on a subset of hard logic/math prompts it shows reproducible failure modes: prolonged non-productive generation, timeout/connection failures, and thermal escalation. Isolation with minimal direct scripts supports that this is not a benchmark-framework orchestration artifact. The resulting decision criterion is operational: user-facing latency reliability and thermal safety.

10.6 Phased Ad-Hoc Treatment

We executed a dedicated phased ad-hoc treatment on app /api/stream with final-answer-only prompting and no retries, then rejudged with gpt-oss:20b.

Phase	Result
Preflight	Endpoints/models/datasets validated (with hardened checks).
Smoke gate (8 tasks)	Passed: `ok_runs=8`, `non_empty_runs=8`, `timeout_runs=0`.
Full canonical run (80 tasks)	Transport stable: `ok_rate=1.0`, `non_empty_rate=1.0`, `timeout_rate=0.0`.
Rejudge (`gpt-oss:20b`)	Official semantic pass: `44/80 = 55%`; manual-adjusted signal around `~60%` after parse-failure review.

10.7 Minimal CLI Reproduction (Logic + Math)

Final direct CLI evidence is consolidated in benchmark/magistral/results/2026-02-21_CLI_logic_math_results.log. Only these tests are used for this subsection.

Task	Prompt regime	Observed outcome
logic_037	Raw question	Very long reasoning loop (`real 801.16s`), drifting conclusions and unstable termination.
logic_037	Short-answer prefixed (2 runs)	Faster (`real 9.17s`, `4.98s`) but still non-compliant formatting (explanations despite final-only request).
Absolute-value equation	Raw question	Correct solution set after excessive chain-of-thought expansion (`real 224.67s`).
Absolute-value equation	Short-answer prefixed (2 runs)	Fast but wrong outputs (`real 8.53s`, `2.13s`), with solution drift across repeats.

CLI conclusion from these tests: prompt constraints can reduce latency, but do not stabilize correctness or format compliance. Raw mode can recover correctness on some math items, yet with impractical latency and uncontrolled reasoning sprawl.

10.8 Evidence Package

Reproducible references for this case study are stored in: benchmark/magistral/scripts/ and benchmark/magistral/results/ (including 2026-02-21_CLI_logic_math_results.log).

10.9 Updated Learning

Prompt-regime switching exposes a hard tradeoff: short answers improve speed, raw answers sometimes preserve math correctness, but neither regime is stationary.
Overthinking is a termination-control failure mode: repeated self-restarts and answer rewrites inflate latency without quality gain.
For this environment, evaluation must track both correctness and operational behavior (latency variance, format compliance, and server stress), not score alone.
Phased ad-hoc treatment remains useful for containment, but does not eliminate intrinsic variance under unrestricted CLI prompting.

10.10 Decision

Exclude magistral:24b from final ranking on this hardware profile to preserve fairness, comparability, and operational safety.
Keep magistral:24b available in app model lists as optional/experimental, with explicit caveat on prompt sensitivity and variance.
Keep it as a stress-case model for robustness and termination-control hardening.

10.11 Final Conclusion (Technical + Humanist Approach)

Technical: in this environment, magistral:24b is non-stationary across prompt regimes. Short constraints can accelerate outputs but may degrade correctness; raw mode can preserve some reasoning quality but at prohibitive latency and unstable stopping behavior.

Humanist approach: the model often reads like a deep, open-minded thinker with strong intellectual tone, but still feels under-refined for predictable production behavior even on easy-to-medium logic/math tasks.

Final position: keep Magistral as a valuable experimental model and stress benchmark subject, but outside the final published ranking for this hardware profile.

11. Limitations and Future Work

11.1 Limitations

Results remain hardware-profile specific; conclusions are valid for this local stack, not universal model ranking claims.
Prompt-regime sensitivity is material: the same model/task can shift between fast-but-wrong and slow-but-correct trajectories.
Semantic judging is still an estimation layer: we keep multiple judges, but disagreement and parser-format failures can still affect headline interpretation.
Aggregate telemetry can hide tail risk; extreme loop events (very long reasoning runs) are better captured by explicit CLI stress probes than by averages alone.
Current quality summaries do not yet include a formal consensus score across judges (agreement rate / disagreement taxonomy published as a single KPI).

11.2 Future Work

Run repeated full benchmarks per model/profile and publish confidence intervals for pass rate, TTFT, and throughput.
Add a dedicated termination-control track (repetition ratio, self-restart count, final-answer mutation after first conclusion).
Publish multi-judge consensus KPIs (agreement matrix for gpt-oss:20b, Codex, Claude) alongside model scores.
Keep a dual-profile evaluation policy: constrained short-answer profile and natural/raw profile, both required for promotion decisions.
Promote the CLI logic+math micro-suite (2026-02-21_CLI_logic_math_results.log) as a permanent gate for variance detection.
Add latency-percentile and long-tail reporting (p95/p99 + max) to complement means and improve operational risk visibility.

12. Reproducibility

12.1 Data Source

Metrics and conclusions combine: DB-backed benchmark evidence from db/benchmark.db (run 9cc182d7-74c0-4ac2-a0eb-3ed86afd142b) and phased ad-hoc artifacts in benchmark/magistral/results/ (notably 2026-02-20_magistral_adhoc_smoke_one_each.json, 2026-02-20_magistral_adhoc_canonical_80_raw.json, 2026-02-20_magistral_adhoc_canonical_80_rejudge_gptoss.json).

12.2 Outcome Definitions

Chat tracks: pass/fail from recorded task status. Shortform tracks: primary run outcome is kept from recorded task status, and semantic opinions from gpt-oss:20b, Codex, and Claude are stored in DB as parallel judge layers for this full run.

Several-judges philosophy: no single judge is treated as absolute truth. We keep all judge evaluations side by side to measure consensus, detect disagreement pockets, and separate model behavior from judge-specific bias. Ranking conclusions are based on convergent signals across judges plus operational metrics (latency, stability, and transport reliability), not on one semantic scorer alone.

12.3 Radar Axes

TTFT, Throughput, Math, Reasoning, Instruction, Chat UX, Thermal (normalized 0-100).