Live Benchmark Monitor

Real-time observability while a benchmark run executes: system telemetry, task progression, per-task streaming details, and live pass/fail accounting.

Open Live Monitor ↗

What It Is

The live monitor is a standalone dashboard that connects to the benchmark runner's SSE telemetry endpoint (/api/benchmark/telemetry/stream) and updates in real time. It is useful when a run_benchmark.py session is active — outside of that context it waits silently for events.

It complements the static benchmark reports (Guided, Claude) by showing the execution side: is the server keeping up? Is the GPU thermally stable? Which tasks just passed or failed?

Access

Route: /docs/benchmark_monitor.html — publicly accessible, no login required.
Data feed: connects automatically to the benchmark telemetry SSE endpoint. If no run is active the page shows a waiting indicator.
Chrome toggle: the "Hide header" button collapses the nav bar so charts have maximum vertical space. The state persists in localStorage.

Realtime Telemetry

Six auto-refreshing ECharts time-series, each toggleable via the buttons above the grid. Each chart scrolls a 60-sample window.

TPS (tokens/s): inference throughput of the model currently streaming. The primary performance signal during a run.
FD (file descriptors): open FD count on the Flask process. A steady climb can indicate a resource leak.
CPU (%): host CPU load. Color bands: green < 20 %, blue 20–50 %, yellow 50–70 %, red > 70 %.
Disk (MB/s): aggregate disk I/O. Spikes often coincide with model loading or log writes.
GPU (%): GPU utilization. Stays high during active inference, drops to near-zero during the cooling gap between tasks.
Temp (°C): GPU die temperature. Bands: blue < 50 °C → green → yellow → orange → red > 95 °C.

Task Lists

Three pill grids updated live as tasks complete. Each grid can be hidden via the toggle buttons at the top of the section.

Models: one pill per model in the run. Shows its ordinal (e.g., 2/5) and current status.
Datasets Overview: one pill per dataset. Turns green when all tasks in that dataset are done.
Current Dataset Tasks: one pill per task in the active dataset. Colored by state: grey = pending, amber = running, green = pass, red = fail/error.

A pulsing amber halo on a running task dot indicates the active LLM call. It disappears as soon as the result is evaluated.

Dashboard Panel

Three KPI cards that summarize the current moment without needing to read the charts.

Server Status: active model name, Ollama connection state, current GPU temperature, and thermal classification (cool / warm / hot).
Progress Snapshot: how far into the model/dataset/task matrix the run is, plus live OK %, KO %, and OTHER % and remaining task count.
Task Details: attempt number, task status label, TTFT (time to first token), streaming speed, and end-to-end latency of the current or last completed task.

Streaming Details

The bottom card shows the full drill-down for the task currently executing (or the last one that ran, labelled "last task").

Workflow Progress pills

Five sequential stages, each showing elapsed time once passed:

Cooling: mandatory inter-task pause so the GPU can recover before the next call. Duration is configurable in config.yaml.
Thinking: the model is generating but hidden reasoning tokens (<think>…</think>) are still being produced. No visible output yet.
Streaming: visible tokens are flowing. TTFT is captured at the start of this stage.
Evaluating: the completed response is being scored by the evaluator (exact match, extractive QA, code execution, etc.).
Done: result written to DB. Final PASS / FAIL / ERROR determined.

Text panels

Status: current workflow stage in uppercase.
Prompt: the exact prompt sent to the model, including any injected system context.
Response: the raw model output as it arrives, updated token-by-token during streaming.
Result: the evaluator verdict and any scoring detail (expected vs. actual, score threshold, etc.).

When No Run Is Active

All panels show dashes or empty grids. A yellow waiting-indicator banner appears: "Waiting for benchmark events…". The SSE connection retries automatically. As soon as run_benchmark.py starts publishing events the monitor fills in live without any page reload.

Troubleshooting

Charts never render: ECharts JS assets missing from /js/ — re-run scripts/init.sh.
Waiting indicator stuck: confirm the benchmark runner is active and the Flask server is reachable. Check log/server.out.log for SSE errors.
GPU chart flat at 0: the app_gpu.js poller may not have permission to read GPU metrics on this machine. CPU and temperature charts should still work.
Task pills not updating: the SSE stream may have dropped. Reload the page — the monitor reconnects and replays the current run state.