Benchmark Report (Guided)

A quantitative assessment of 8 local models on Apple Silicon using the KISS benchmark framework.

Mobile tip: for the best reading experience, rotate your phone to landscape when reviewing large tables.

Run ID9cc182d7...
Date2026-02-09
Run Statuscompleted
Models8
Tasks640
Ollamaollama version is 0.13.5

Abstract

This report presents a scientific evaluation of run 9cc182d7-74c0-4ac2-a0eb-3ed86afd142b using the following analysis axes: responsiveness, streaming quality, chat accuracy, task versatility, and thermal stability.

Task versatility is decomposed into Math, Reasoning, and Instruction, yielding a 7-axis comparative radar profile per model.

The final distribution is 401/640 passed (62.7%) and 239/640 failed (37.3%) after GPT Codex 5.3 semantic re-judging of shortform samples.

Key Findings

8
Models Evaluated
640
Total Tasks
54.5%
Avg Shortform Accuracy
62.7%
Global Pass Rate

Principal Conclusions

  1. Best composite model: gemma3:4b with fitness 88.7/100.
  2. Fastest response: gemma3:4b at 11.2s mean TTFT.
  3. Highest throughput: gemma3:4b at 89.3 tok/s.
  4. Highest task reliability: gemma3:12b with 77.5% pass rate.

1. Introduction

1.1 Motivation

The objective is to derive operational model rankings from observed benchmark behavior under a unified local-inference environment.

1.2 Scope

Scope includes eight models over eight datasets (five shortform + three chat), totaling 640 model-task evaluations.

1.3 Experimental Endpoint

All tasks in scope are terminally labeled and included in comparative statistics.

1.4 Live Benchmark Monitor Companion

The companion page benchmark_monitor.html is a read-only, database-backed monitor for ongoing runs and is the fastest way to understand benchmark execution in real time: it shows run scope and state (models/datasets/tasks), rolling performance telemetry (TTFT, tok/s, CPU, GPU utilization, GPU temperature, disk I/O), dataset-level completion with per-task pills that switch from progress to pass-rate as each dataset closes, and a detailed “Last Task Executed” panel with workflow stage, prompt, streaming response, evaluation result, and task KPIs; it also handles resets/resume correctly by tracking run identity and live task events, so readers can trust what they see even when a run restarts, resumes, or is only partially completed. Open it directly from this report at Live Benchmark Monitor.

2. Methodology

2.1 Five Principal Axes

Axis 1: Responsiveness (TTFT)

Lower TTFT maps to higher normalized score.

🌊

Axis 2: Streaming Quality (tok/s)

Higher sustained token throughput maps to higher score.

💬

Axis 3: Chat Accuracy

Measured as pass rate over chat_instruction, chat_memory, and chat_safety tracks.

🎯

Axis 4: Task Versatility

Decomposed into Math (cs+physics), Reasoning (critical+logic), and Instruction (chat_instruction).

🌡️

Axis 5: Thermal Stability

Lower mean GPU temperature maps to higher normalized thermal score.

2.2 Scoring Formula

Composite fitness follows the benchmark family weighting:

fitness = 0.50 × chat_ux + 0.30 × speed + 0.20 × shortform_quality

where speed = 0.6 × TTFT_score + 0.4 × throughput_score.

2.3 Hardware Configuration

ComponentSpecification
DeviceMac mini M4 Pro
CPUApple M4 Pro, 12-core
GPU16-core integrated
Unified RAM24 GB
Inference Engineollama version is 0.13.5

2.4 Model Registry

ModelOriginParamsQuantArchitectureWarmup TTFT
deepseek-r1:14bDeepSeek14.1BQ4_K_MReasoning-heavy11.1s
deepseek-r1:8bDeepSeek8.2BQ4_K_MReasoning-heavy7.7s
gemma3:12bGoogle12.2BQ4_K_MDirect-response9.5s
gemma3:4bGoogle4.3BQ4_K_MDirect-response5.6s
gpt-oss:20bOpenAI20.9BMXFP4Reasoning-heavy11.2s
qwen3:14bAlibaba14.5BQ4_K_MReasoning-heavy9.5s
qwen3:4bAlibaba4.0BQ4_K_MReasoning-heavy6.5s
qwen3:8bAlibaba8.2BQ4_K_MReasoning-heavy7.8s

2.5 Dataset Inventory

DatasetKindTotalPassedFailedPass Rate
chat_instructionchat80582272.5%
chat_memorychat80671383.8%
chat_safetychat80582272.5%
critical_thinkingshortform80225827.5%
cs_engineeringshortform80413951.2%
historyshortform80522865.0%
logic_deductionshortform80572371.2%
physicsshortform80463457.5%

3. Results: Short-Form Evaluation

3.1 Per-Model Accuracy

RankModelPassFailShortformChat UXTTFTTPSTempFitness
1gemma3:4b72.5%27.5%60.0%93.3%11.2s89.345.3°C88.7
2gemma3:12b77.5%22.5%68.0%93.3%11.3s38.745.1°C81.6
3deepseek-r1:8b65.0%35.0%54.0%83.3%18.4s36.678.5°C65.0
4qwen3:8b61.2%38.8%58.0%66.7%16.2s34.572.3°C59.8
5qwen3:4b56.2%43.8%54.0%60.0%15.4s49.968.6°C59.2
6gpt-oss:20b70.0%30.0%64.0%80.0%25.4s41.458.1°C57.8
7deepseek-r1:14b41.2%58.8%28.0%63.3%19.8s20.081.0°C45.4
8qwen3:14b57.5%42.5%50.0%70.0%26.3s17.782.6°C45.0

3.2 Speed Metrics

Average TTFT (lower is better)

gemma3:4b
11.2s
gemma3:12b
11.3s
qwen3:4b
15.4s
qwen3:8b
16.2s
deepseek-r1:8b
18.4s
deepseek-r1:14b
19.8s
gpt-oss:20b
25.4s
qwen3:14b
26.3s

4. Seven-Axis Radar Visualization

The radar chart uses this report's seven-axis family: TTFT, Throughput, Math, Reasoning, Instruction, Chat UX, and Thermal. All values are normalized to 0-100.

5. Failure Analysis

5.1 KO Concentration

Dominant KO datasets

Failure mass is concentrated in critical_thinking, cs_engineering, and physics.

5.2 Failure Summary by Dataset

DatasetKO CountKO RateTotal
critical_thinking5872.5%80
cs_engineering3948.8%80
physics3442.5%80
history2835.0%80
logic_deduction2328.8%80
chat_instruction2227.5%80
chat_safety2227.5%80
chat_memory1316.2%80

6. Chat UX Track Results

6.1 Turn Compliance Rate (Task-Level)

ModelTasksChat %TTFTtok/s
gemma3:12b3093.3%11.3s38.7
gemma3:4b3093.3%11.2s89.3
deepseek-r1:8b3083.3%18.4s36.6
gpt-oss:20b3080.0%25.4s41.4
qwen3:14b3070.0%26.3s17.7
qwen3:8b3066.7%16.2s34.5
deepseek-r1:14b3063.3%19.8s20.0
qwen3:4b3060.0%15.4s49.9

6.2 Chat TTFT Distribution

Observed model means span 11.2s to 26.3s in TTFT.

7. Thermal Analysis

Average GPU Temperature (lower is better)

gemma3:12b
45.1°C
gemma3:4b
45.3°C
gpt-oss:20b
58.1°C
qwen3:4b
68.6°C
qwen3:8b
72.3°C
deepseek-r1:8b
78.5°C
deepseek-r1:14b
81.0°C
qwen3:14b
82.6°C

Thermal spread is 45.1°C to 82.6°C across evaluated models.

8. Composite Rankings

8.1 Efficiency Frontier

The efficiency frontier is defined by models that preserve high chat quality with lower latency and manageable thermal load under the fitness objective.

8.2 Recommendations by Use Case

Use CaseRecommended ModelRationale
Balanced Productiongemma3:4bHighest composite fitness (88.7).
Lowest Latencygemma3:4bLowest mean TTFT (11.2s).
Highest Throughputgemma3:4bHighest mean output rate (89.3 tok/s).
Highest Reliabilitygemma3:12bBest global pass rate (77.5%).

9. Chain-of-Thought (CoT) Analysis

This benchmark is explicitly oriented to the application use case: fast, chat-friendly interaction. In that context, CoT behavior is not a side detail; it directly affects first-token latency and perceived responsiveness.

9.1 Behavioral Mechanism

As characterized in benchmark/test_thinking.py, some models emit hidden reasoning before visible output. This produces an architecture-level latency overhead: the user sees the first token only after internal reasoning finishes.

Concretely, benchmark/test_thinking.py sends a deterministic two-message chat request to each model (system: final-answer-only, user: "What is 2+2?") and inspects the Ollama response fields. It records: eval_count (generated tokens), end-to-end elapsed time, derived tokens/sec, visible message.content length, hidden message.thinking length, and the thinking/content character split. If message.thinking is non-empty, the model is classified as using a reasoning-first response mode in this probe.

This is a behavioral probe, not a quality benchmark: it is intentionally small and controlled to isolate response mode effects. It does not grade task correctness breadth, nor does it estimate full benchmark variance across prompt families.

9.2 Response-Mode Interpretation

Model Families in This Run

Direct-response profile

gemma3:4b, gemma3:12b show the lowest TTFT in this run and the highest chat UX outcomes, consistent with chat-first serving behavior.

Reasoning-heavy profile

qwen3:*, deepseek-r1:*, and gpt-oss:20b can preserve useful reasoning quality but incur larger TTFT overhead in interactive chat settings.

9.3 Implications for Metrics in a Fast-Chat Benchmark

MetricInterpretation Under CoT Behavior
TTFT Most sensitive metric for chat UX; hidden reasoning inflates first-token delay.
Tokens/sec Still valid for generation throughput, but does not cancel first-token waiting cost.
Wall-clock response feel Dominated by TTFT in chat flows; users perceive delay before any visible answer.
Accuracy Reasoning overhead may help on harder tasks, but can underperform in chat-first utility when latency is prioritized.

9.4 Benchmark Positioning

This report should be read as a use-case benchmark, not a universal intelligence ranking. The scoring favors models that are both correct and immediately responsive for conversational product UX. A model can be strong in deeper reasoning workloads and still rank lower here if its response mode is slower.

9.5 Practical Reading of Results

For this app, lower TTFT and high chat compliance are primary deployment criteria. The CoT framing explains why some reasoning-heavy models underperform in this benchmark despite competitive quality on non-chat workloads.

10. Magistral Case Study (Excluded from Final Ranking)

magistral:24b was included in our local benchmark and ad-hoc test programs because it is present in our app model catalog, performs well in many interactive sessions, and is a strategically relevant European model family to evaluate in real operating conditions.

In this report, Magistral is documented as a dedicated case study because its behavior diverged across evaluation regimes. The final ranking exclusion is therefore presented after evidence review, and is treated as a hardware-profile reliability decision, not a general rejection of model potential.

Why this is a surprise

Interactive app usage and benchmark stress workload are different operating regimes. A model can feel strong in user-driven chat sessions but still collapse under sustained, back-to-back hard benchmark prompts with strict timeout and fairness controls.

10.1 Quantitative Evidence (DB-backed)

Aggregated benchmark observations show magistral:24b has non-stationary behavior on this hardware profile: it can produce acceptable outcomes in some benchmark executions, and collapse in others.

Observed regimeInterpretation
Stable execution episodesModel completes substantial workload with usable quality.
Collapse episodesSystematic timeout/hang patterns dominate and break comparability.
Intermediate episodesPartial completion with elevated error density and degraded throughput.

10.2 Failure Modes Detected

Minimal isolation tests reproduced the same pattern, so this is not attributed to benchmark-framework orchestration. Operationally, these cases exceed user-friendly latency bounds.

10.3 Error Signatures

10.4 Thermal Risk Observation

During the latest stress reproduction, direct sensor checks reported: GPU temperature: 97.6°C, GPU utilization: 100%, and panel telemetry showing hottest GPU around 97.7°C with average GPU around 92.2°C. This is outside a comfortable sustained operating envelope for user-facing reliability.

10.5 Interpretation

The evidence indicates a difficulty-dependent reliability instability on this hardware profile. The model can provide acceptable outputs on lower-complexity prompts, but on a subset of hard logic/math prompts it shows reproducible failure modes: prolonged non-productive generation, timeout/connection failures, and thermal escalation. Isolation with minimal direct scripts supports that this is not a benchmark-framework orchestration artifact. The resulting decision criterion is operational: user-facing latency reliability and thermal safety.

10.6 Phased Ad-Hoc Treatment

We executed a dedicated phased ad-hoc treatment on app /api/stream with final-answer-only prompting and no retries, then rejudged with gpt-oss:20b.

PhaseResult
PreflightEndpoints/models/datasets validated (with hardened checks).
Smoke gate (8 tasks)Passed: ok_runs=8, non_empty_runs=8, timeout_runs=0.
Full canonical run (80 tasks)Transport stable: ok_rate=1.0, non_empty_rate=1.0, timeout_rate=0.0.
Rejudge (gpt-oss:20b)Official semantic pass: 44/80 = 55%; manual-adjusted signal around ~60% after parse-failure review.

10.7 Minimal CLI Reproduction (Logic + Math)

Final direct CLI evidence is consolidated in benchmark/magistral/results/2026-02-21_CLI_logic_math_results.log. Only these tests are used for this subsection.

TaskPrompt regimeObserved outcome
logic_037Raw questionVery long reasoning loop (real 801.16s), drifting conclusions and unstable termination.
logic_037Short-answer prefixed (2 runs)Faster (real 9.17s, 4.98s) but still non-compliant formatting (explanations despite final-only request).
Absolute-value equationRaw questionCorrect solution set after excessive chain-of-thought expansion (real 224.67s).
Absolute-value equationShort-answer prefixed (2 runs)Fast but wrong outputs (real 8.53s, 2.13s), with solution drift across repeats.

CLI conclusion from these tests: prompt constraints can reduce latency, but do not stabilize correctness or format compliance. Raw mode can recover correctness on some math items, yet with impractical latency and uncontrolled reasoning sprawl.

10.8 Evidence Package

Reproducible references for this case study are stored in: benchmark/magistral/scripts/ and benchmark/magistral/results/ (including 2026-02-21_CLI_logic_math_results.log).

10.9 Updated Learning

10.10 Decision

10.11 Final Conclusion (Technical + Humanist Approach)

Technical: in this environment, magistral:24b is non-stationary across prompt regimes. Short constraints can accelerate outputs but may degrade correctness; raw mode can preserve some reasoning quality but at prohibitive latency and unstable stopping behavior.

Humanist approach: the model often reads like a deep, open-minded thinker with strong intellectual tone, but still feels under-refined for predictable production behavior even on easy-to-medium logic/math tasks.

Final position: keep Magistral as a valuable experimental model and stress benchmark subject, but outside the final published ranking for this hardware profile.

11. Limitations and Future Work

11.1 Limitations

11.2 Future Work

12. Reproducibility

12.1 Data Source

Metrics and conclusions combine: DB-backed benchmark evidence from db/benchmark.db (run 9cc182d7-74c0-4ac2-a0eb-3ed86afd142b) and phased ad-hoc artifacts in benchmark/magistral/results/ (notably 2026-02-20_magistral_adhoc_smoke_one_each.json, 2026-02-20_magistral_adhoc_canonical_80_raw.json, 2026-02-20_magistral_adhoc_canonical_80_rejudge_gptoss.json).

12.2 Outcome Definitions

Chat tracks: pass/fail from recorded task status. Shortform tracks: primary run outcome is kept from recorded task status, and semantic opinions from gpt-oss:20b, Codex, and Claude are stored in DB as parallel judge layers for this full run.

Several-judges philosophy: no single judge is treated as absolute truth. We keep all judge evaluations side by side to measure consensus, detect disagreement pockets, and separate model behavior from judge-specific bias. Ranking conclusions are based on convergent signals across judges plus operational metrics (latency, stability, and transport reliability), not on one semantic scorer alone.

12.3 Radar Axes

TTFT, Throughput, Math, Reasoning, Instruction, Chat UX, Thermal (normalized 0-100).