Benchmark Report (Guided)
A quantitative assessment of 8 local models on Apple Silicon using the KISS benchmark framework.
Mobile tip: for the best reading experience, rotate your phone to landscape when reviewing large tables.
Abstract
This report presents a scientific evaluation of run 9cc182d7-74c0-4ac2-a0eb-3ed86afd142b using the following analysis axes: responsiveness, streaming quality, chat accuracy, task versatility, and thermal stability.
Task versatility is decomposed into Math, Reasoning, and Instruction, yielding a 7-axis comparative radar profile per model.
The final distribution is 401/640 passed (62.7%) and 239/640 failed (37.3%) after GPT Codex 5.3 semantic re-judging of shortform samples.
Key Findings
Principal Conclusions
- Best composite model:
gemma3:4bwith fitness 88.7/100. - Fastest response:
gemma3:4bat 11.2s mean TTFT. - Highest throughput:
gemma3:4bat 89.3 tok/s. - Highest task reliability:
gemma3:12bwith 77.5% pass rate.
1. Introduction
1.1 Motivation
The objective is to derive operational model rankings from observed benchmark behavior under a unified local-inference environment.
1.2 Scope
Scope includes eight models over eight datasets (five shortform + three chat), totaling 640 model-task evaluations.
1.3 Experimental Endpoint
All tasks in scope are terminally labeled and included in comparative statistics.
1.4 Live Benchmark Monitor Companion
The companion page benchmark_monitor.html is a read-only, database-backed monitor for ongoing runs and is the fastest way to understand benchmark execution in real time: it shows run scope and state (models/datasets/tasks), rolling performance telemetry (TTFT, tok/s, CPU, GPU utilization, GPU temperature, disk I/O), dataset-level completion with per-task pills that switch from progress to pass-rate as each dataset closes, and a detailed “Last Task Executed” panel with workflow stage, prompt, streaming response, evaluation result, and task KPIs; it also handles resets/resume correctly by tracking run identity and live task events, so readers can trust what they see even when a run restarts, resumes, or is only partially completed. Open it directly from this report at Live Benchmark Monitor.
2. Methodology
2.1 Five Principal Axes
Axis 1: Responsiveness (TTFT)
Lower TTFT maps to higher normalized score.
Axis 2: Streaming Quality (tok/s)
Higher sustained token throughput maps to higher score.
Axis 3: Chat Accuracy
Measured as pass rate over chat_instruction, chat_memory, and chat_safety tracks.
Axis 4: Task Versatility
Decomposed into Math (cs+physics), Reasoning (critical+logic), and Instruction (chat_instruction).
Axis 5: Thermal Stability
Lower mean GPU temperature maps to higher normalized thermal score.
2.2 Scoring Formula
Composite fitness follows the benchmark family weighting:
where speed = 0.6 × TTFT_score + 0.4 × throughput_score.
2.3 Hardware Configuration
| Component | Specification |
|---|---|
| Device | Mac mini M4 Pro |
| CPU | Apple M4 Pro, 12-core |
| GPU | 16-core integrated |
| Unified RAM | 24 GB |
| Inference Engine | ollama version is 0.13.5 |
2.4 Model Registry
| Model | Origin | Params | Quant | Architecture | Warmup TTFT |
|---|---|---|---|---|---|
deepseek-r1:14b | DeepSeek | 14.1B | Q4_K_M | Reasoning-heavy | 11.1s |
deepseek-r1:8b | DeepSeek | 8.2B | Q4_K_M | Reasoning-heavy | 7.7s |
gemma3:12b | 12.2B | Q4_K_M | Direct-response | 9.5s | |
gemma3:4b | 4.3B | Q4_K_M | Direct-response | 5.6s | |
gpt-oss:20b | OpenAI | 20.9B | MXFP4 | Reasoning-heavy | 11.2s |
qwen3:14b | Alibaba | 14.5B | Q4_K_M | Reasoning-heavy | 9.5s |
qwen3:4b | Alibaba | 4.0B | Q4_K_M | Reasoning-heavy | 6.5s |
qwen3:8b | Alibaba | 8.2B | Q4_K_M | Reasoning-heavy | 7.8s |
2.5 Dataset Inventory
| Dataset | Kind | Total | Passed | Failed | Pass Rate |
|---|---|---|---|---|---|
| chat_instruction | chat | 80 | 58 | 22 | 72.5% |
| chat_memory | chat | 80 | 67 | 13 | 83.8% |
| chat_safety | chat | 80 | 58 | 22 | 72.5% |
| critical_thinking | shortform | 80 | 22 | 58 | 27.5% |
| cs_engineering | shortform | 80 | 41 | 39 | 51.2% |
| history | shortform | 80 | 52 | 28 | 65.0% |
| logic_deduction | shortform | 80 | 57 | 23 | 71.2% |
| physics | shortform | 80 | 46 | 34 | 57.5% |
3. Results: Short-Form Evaluation
3.1 Per-Model Accuracy
| Rank | Model | Pass | Fail | Shortform | Chat UX | TTFT | TPS | Temp | Fitness |
|---|---|---|---|---|---|---|---|---|---|
| 1 | gemma3:4b | 72.5% | 27.5% | 60.0% | 93.3% | 11.2s | 89.3 | 45.3°C | 88.7 |
| 2 | gemma3:12b | 77.5% | 22.5% | 68.0% | 93.3% | 11.3s | 38.7 | 45.1°C | 81.6 |
| 3 | deepseek-r1:8b | 65.0% | 35.0% | 54.0% | 83.3% | 18.4s | 36.6 | 78.5°C | 65.0 |
| 4 | qwen3:8b | 61.2% | 38.8% | 58.0% | 66.7% | 16.2s | 34.5 | 72.3°C | 59.8 |
| 5 | qwen3:4b | 56.2% | 43.8% | 54.0% | 60.0% | 15.4s | 49.9 | 68.6°C | 59.2 |
| 6 | gpt-oss:20b | 70.0% | 30.0% | 64.0% | 80.0% | 25.4s | 41.4 | 58.1°C | 57.8 |
| 7 | deepseek-r1:14b | 41.2% | 58.8% | 28.0% | 63.3% | 19.8s | 20.0 | 81.0°C | 45.4 |
| 8 | qwen3:14b | 57.5% | 42.5% | 50.0% | 70.0% | 26.3s | 17.7 | 82.6°C | 45.0 |
3.2 Speed Metrics
Average TTFT (lower is better)
4. Seven-Axis Radar Visualization
The radar chart uses this report's seven-axis family: TTFT, Throughput, Math, Reasoning, Instruction, Chat UX, and Thermal. All values are normalized to 0-100.
5. Failure Analysis
5.1 KO Concentration
Dominant KO datasets
Failure mass is concentrated in critical_thinking, cs_engineering, and physics.
5.2 Failure Summary by Dataset
| Dataset | KO Count | KO Rate | Total |
|---|---|---|---|
| critical_thinking | 58 | 72.5% | 80 |
| cs_engineering | 39 | 48.8% | 80 |
| physics | 34 | 42.5% | 80 |
| history | 28 | 35.0% | 80 |
| logic_deduction | 23 | 28.8% | 80 |
| chat_instruction | 22 | 27.5% | 80 |
| chat_safety | 22 | 27.5% | 80 |
| chat_memory | 13 | 16.2% | 80 |
6. Chat UX Track Results
6.1 Turn Compliance Rate (Task-Level)
| Model | Tasks | Chat % | TTFT | tok/s |
|---|---|---|---|---|
gemma3:12b | 30 | 93.3% | 11.3s | 38.7 |
gemma3:4b | 30 | 93.3% | 11.2s | 89.3 |
deepseek-r1:8b | 30 | 83.3% | 18.4s | 36.6 |
gpt-oss:20b | 30 | 80.0% | 25.4s | 41.4 |
qwen3:14b | 30 | 70.0% | 26.3s | 17.7 |
qwen3:8b | 30 | 66.7% | 16.2s | 34.5 |
deepseek-r1:14b | 30 | 63.3% | 19.8s | 20.0 |
qwen3:4b | 30 | 60.0% | 15.4s | 49.9 |
6.2 Chat TTFT Distribution
Observed model means span 11.2s to 26.3s in TTFT.
7. Thermal Analysis
Average GPU Temperature (lower is better)
Thermal spread is 45.1°C to 82.6°C across evaluated models.
8. Composite Rankings
8.1 Efficiency Frontier
The efficiency frontier is defined by models that preserve high chat quality with lower latency and manageable thermal load under the fitness objective.
8.2 Recommendations by Use Case
| Use Case | Recommended Model | Rationale |
|---|---|---|
| Balanced Production | gemma3:4b | Highest composite fitness (88.7). |
| Lowest Latency | gemma3:4b | Lowest mean TTFT (11.2s). |
| Highest Throughput | gemma3:4b | Highest mean output rate (89.3 tok/s). |
| Highest Reliability | gemma3:12b | Best global pass rate (77.5%). |
9. Chain-of-Thought (CoT) Analysis
This benchmark is explicitly oriented to the application use case: fast, chat-friendly interaction. In that context, CoT behavior is not a side detail; it directly affects first-token latency and perceived responsiveness.
9.1 Behavioral Mechanism
As characterized in benchmark/test_thinking.py, some models emit hidden reasoning before visible output.
This produces an architecture-level latency overhead: the user sees the first token only after internal reasoning finishes.
Concretely, benchmark/test_thinking.py sends a deterministic two-message chat request to each model
(system: final-answer-only, user: "What is 2+2?") and inspects the Ollama response fields.
It records:
eval_count (generated tokens), end-to-end elapsed time, derived tokens/sec, visible message.content length,
hidden message.thinking length, and the thinking/content character split.
If message.thinking is non-empty, the model is classified as using a reasoning-first response mode in this probe.
This is a behavioral probe, not a quality benchmark: it is intentionally small and controlled to isolate response mode effects. It does not grade task correctness breadth, nor does it estimate full benchmark variance across prompt families.
9.2 Response-Mode Interpretation
Model Families in This Run
Direct-response profile
gemma3:4b, gemma3:12b show the lowest TTFT in this run and the highest chat UX outcomes, consistent with chat-first serving behavior.
Reasoning-heavy profile
qwen3:*, deepseek-r1:*, and gpt-oss:20b can preserve useful reasoning quality but incur larger TTFT overhead in interactive chat settings.
9.3 Implications for Metrics in a Fast-Chat Benchmark
| Metric | Interpretation Under CoT Behavior |
|---|---|
| TTFT | Most sensitive metric for chat UX; hidden reasoning inflates first-token delay. |
| Tokens/sec | Still valid for generation throughput, but does not cancel first-token waiting cost. |
| Wall-clock response feel | Dominated by TTFT in chat flows; users perceive delay before any visible answer. |
| Accuracy | Reasoning overhead may help on harder tasks, but can underperform in chat-first utility when latency is prioritized. |
9.4 Benchmark Positioning
This report should be read as a use-case benchmark, not a universal intelligence ranking. The scoring favors models that are both correct and immediately responsive for conversational product UX. A model can be strong in deeper reasoning workloads and still rank lower here if its response mode is slower.
9.5 Practical Reading of Results
For this app, lower TTFT and high chat compliance are primary deployment criteria. The CoT framing explains why some reasoning-heavy models underperform in this benchmark despite competitive quality on non-chat workloads.
10. Magistral Case Study (Excluded from Final Ranking)
magistral:24b was included in our local benchmark and ad-hoc test programs because it is present in our app model catalog,
performs well in many interactive sessions, and is a strategically relevant European model family to evaluate in real operating conditions.
In this report, Magistral is documented as a dedicated case study because its behavior diverged across evaluation regimes. The final ranking exclusion is therefore presented after evidence review, and is treated as a hardware-profile reliability decision, not a general rejection of model potential.
Why this is a surprise
Interactive app usage and benchmark stress workload are different operating regimes. A model can feel strong in user-driven chat sessions but still collapse under sustained, back-to-back hard benchmark prompts with strict timeout and fairness controls.
10.1 Quantitative Evidence (DB-backed)
Aggregated benchmark observations show magistral:24b has non-stationary behavior on this hardware profile:
it can produce acceptable outcomes in some benchmark executions, and collapse in others.
| Observed regime | Interpretation |
|---|---|
| Stable execution episodes | Model completes substantial workload with usable quality. |
| Collapse episodes | Systematic timeout/hang patterns dominate and break comparability. |
| Intermediate episodes | Partial completion with elevated error density and degraded throughput. |
10.2 Failure Modes Detected
- On a subset of hard logic/math prompts, generation can enter long non-productive loops.
- These events correlate with timeout/connection failures and unstable endpoint behavior.
- Loop episodes can degrade Ollama stability and require manual unload/restart for recovery.
- Performance is substantially better on lower-complexity prompts, indicating a difficulty-dependent failure profile.
Minimal isolation tests reproduced the same pattern, so this is not attributed to benchmark-framework orchestration. Operationally, these cases exceed user-friendly latency bounds.
10.3 Error Signatures
HTTPConnectionPool(...): Read timed out. (read timeout=60)RemoteDisconnected('Remote end closed connection without response')Max retries exceeded ... /api/streamafter instability propagation
10.4 Thermal Risk Observation
During the latest stress reproduction, direct sensor checks reported:
GPU temperature: 97.6°C, GPU utilization: 100%,
and panel telemetry showing hottest GPU around 97.7°C with average GPU around 92.2°C.
This is outside a comfortable sustained operating envelope for user-facing reliability.
10.5 Interpretation
The evidence indicates a difficulty-dependent reliability instability on this hardware profile. The model can provide acceptable outputs on lower-complexity prompts, but on a subset of hard logic/math prompts it shows reproducible failure modes: prolonged non-productive generation, timeout/connection failures, and thermal escalation. Isolation with minimal direct scripts supports that this is not a benchmark-framework orchestration artifact. The resulting decision criterion is operational: user-facing latency reliability and thermal safety.
10.6 Phased Ad-Hoc Treatment
We executed a dedicated phased ad-hoc treatment on app /api/stream
with final-answer-only prompting and no retries, then rejudged with gpt-oss:20b.
| Phase | Result |
|---|---|
| Preflight | Endpoints/models/datasets validated (with hardened checks). |
| Smoke gate (8 tasks) | Passed: ok_runs=8, non_empty_runs=8, timeout_runs=0. |
| Full canonical run (80 tasks) | Transport stable: ok_rate=1.0, non_empty_rate=1.0, timeout_rate=0.0. |
Rejudge (gpt-oss:20b) | Official semantic pass: 44/80 = 55%; manual-adjusted signal around ~60% after parse-failure review. |
10.7 Minimal CLI Reproduction (Logic + Math)
Final direct CLI evidence is consolidated in
benchmark/magistral/results/2026-02-21_CLI_logic_math_results.log.
Only these tests are used for this subsection.
| Task | Prompt regime | Observed outcome |
|---|---|---|
| logic_037 | Raw question | Very long reasoning loop (real 801.16s), drifting conclusions and unstable termination. |
| logic_037 | Short-answer prefixed (2 runs) | Faster (real 9.17s, 4.98s) but still non-compliant formatting (explanations despite final-only request). |
| Absolute-value equation | Raw question | Correct solution set after excessive chain-of-thought expansion (real 224.67s). |
| Absolute-value equation | Short-answer prefixed (2 runs) | Fast but wrong outputs (real 8.53s, 2.13s), with solution drift across repeats. |
CLI conclusion from these tests: prompt constraints can reduce latency, but do not stabilize correctness or format compliance. Raw mode can recover correctness on some math items, yet with impractical latency and uncontrolled reasoning sprawl.
10.8 Evidence Package
Reproducible references for this case study are stored in:
benchmark/magistral/scripts/ and
benchmark/magistral/results/
(including 2026-02-21_CLI_logic_math_results.log).
10.9 Updated Learning
- Prompt-regime switching exposes a hard tradeoff: short answers improve speed, raw answers sometimes preserve math correctness, but neither regime is stationary.
- Overthinking is a termination-control failure mode: repeated self-restarts and answer rewrites inflate latency without quality gain.
- For this environment, evaluation must track both correctness and operational behavior (latency variance, format compliance, and server stress), not score alone.
- Phased ad-hoc treatment remains useful for containment, but does not eliminate intrinsic variance under unrestricted CLI prompting.
10.10 Decision
- Exclude
magistral:24bfrom final ranking on this hardware profile to preserve fairness, comparability, and operational safety. - Keep
magistral:24bavailable in app model lists as optional/experimental, with explicit caveat on prompt sensitivity and variance. - Keep it as a stress-case model for robustness and termination-control hardening.
10.11 Final Conclusion (Technical + Humanist Approach)
Technical: in this environment, magistral:24b is non-stationary across prompt regimes.
Short constraints can accelerate outputs but may degrade correctness; raw mode can preserve some reasoning quality but at prohibitive latency and unstable stopping behavior.
Humanist approach: the model often reads like a deep, open-minded thinker with strong intellectual tone, but still feels under-refined for predictable production behavior even on easy-to-medium logic/math tasks.
Final position: keep Magistral as a valuable experimental model and stress benchmark subject, but outside the final published ranking for this hardware profile.
11. Limitations and Future Work
11.1 Limitations
- Results remain hardware-profile specific; conclusions are valid for this local stack, not universal model ranking claims.
- Prompt-regime sensitivity is material: the same model/task can shift between fast-but-wrong and slow-but-correct trajectories.
- Semantic judging is still an estimation layer: we keep multiple judges, but disagreement and parser-format failures can still affect headline interpretation.
- Aggregate telemetry can hide tail risk; extreme loop events (very long reasoning runs) are better captured by explicit CLI stress probes than by averages alone.
- Current quality summaries do not yet include a formal consensus score across judges (agreement rate / disagreement taxonomy published as a single KPI).
11.2 Future Work
- Run repeated full benchmarks per model/profile and publish confidence intervals for pass rate, TTFT, and throughput.
- Add a dedicated termination-control track (repetition ratio, self-restart count, final-answer mutation after first conclusion).
- Publish multi-judge consensus KPIs (agreement matrix for
gpt-oss:20b,Codex,Claude) alongside model scores. - Keep a dual-profile evaluation policy: constrained short-answer profile and natural/raw profile, both required for promotion decisions.
- Promote the CLI logic+math micro-suite (
2026-02-21_CLI_logic_math_results.log) as a permanent gate for variance detection. - Add latency-percentile and long-tail reporting (p95/p99 + max) to complement means and improve operational risk visibility.
12. Reproducibility
12.1 Data Source
Metrics and conclusions combine:
DB-backed benchmark evidence from db/benchmark.db (run 9cc182d7-74c0-4ac2-a0eb-3ed86afd142b)
and phased ad-hoc artifacts in benchmark/magistral/results/
(notably 2026-02-20_magistral_adhoc_smoke_one_each.json,
2026-02-20_magistral_adhoc_canonical_80_raw.json,
2026-02-20_magistral_adhoc_canonical_80_rejudge_gptoss.json).
12.2 Outcome Definitions
Chat tracks: pass/fail from recorded task status.
Shortform tracks: primary run outcome is kept from recorded task status, and semantic opinions from
gpt-oss:20b, Codex, and Claude are stored in DB as parallel judge layers for this full run.
Several-judges philosophy: no single judge is treated as absolute truth. We keep all judge evaluations side by side to measure consensus, detect disagreement pockets, and separate model behavior from judge-specific bias. Ranking conclusions are based on convergent signals across judges plus operational metrics (latency, stability, and transport reliability), not on one semantic scorer alone.
12.3 Radar Axes
TTFT, Throughput, Math, Reasoning, Instruction, Chat UX, Thermal (normalized 0-100).