Live Benchmark Monitor
Real-time observability while a benchmark run executes: system telemetry, task progression, per-task streaming details, and live pass/fail accounting.
Open Live Monitor ↗What It Is
The live monitor is a standalone dashboard that connects to the benchmark runner's SSE telemetry endpoint (/api/benchmark/telemetry/stream) and updates in real time. It is useful when a run_benchmark.py session is active — outside of that context it waits silently for events.
It complements the static benchmark reports (Guided, Claude) by showing the execution side: is the server keeping up? Is the GPU thermally stable? Which tasks just passed or failed?
Access
- Route:
/docs/benchmark_monitor.html— publicly accessible, no login required. - Data feed: connects automatically to the benchmark telemetry SSE endpoint. If no run is active the page shows a waiting indicator.
- Chrome toggle: the "Hide header" button collapses the nav bar so charts have maximum vertical space. The state persists in
localStorage.
Realtime Telemetry
Six auto-refreshing ECharts time-series, each toggleable via the buttons above the grid. Each chart scrolls a 60-sample window.
- TPS (tokens/s): inference throughput of the model currently streaming. The primary performance signal during a run.
- FD (file descriptors): open FD count on the Flask process. A steady climb can indicate a resource leak.
- CPU (%): host CPU load. Color bands: green < 20 %, blue 20–50 %, yellow 50–70 %, red > 70 %.
- Disk (MB/s): aggregate disk I/O. Spikes often coincide with model loading or log writes.
- GPU (%): GPU utilization. Stays high during active inference, drops to near-zero during the cooling gap between tasks.
- Temp (°C): GPU die temperature. Bands: blue < 50 °C → green → yellow → orange → red > 95 °C.
Task Lists
Three pill grids updated live as tasks complete. Each grid can be hidden via the toggle buttons at the top of the section.
- Models: one pill per model in the run. Shows its ordinal (e.g., 2/5) and current status.
- Datasets Overview: one pill per dataset. Turns green when all tasks in that dataset are done.
- Current Dataset Tasks: one pill per task in the active dataset. Colored by state: grey = pending, amber = running, green = pass, red = fail/error.
A pulsing amber halo on a running task dot indicates the active LLM call. It disappears as soon as the result is evaluated.
Dashboard Panel
Three KPI cards that summarize the current moment without needing to read the charts.
- Server Status: active model name, Ollama connection state, current GPU temperature, and thermal classification (cool / warm / hot).
- Progress Snapshot: how far into the model/dataset/task matrix the run is, plus live OK %, KO %, and OTHER % and remaining task count.
- Task Details: attempt number, task status label, TTFT (time to first token), streaming speed, and end-to-end latency of the current or last completed task.
Streaming Details
The bottom card shows the full drill-down for the task currently executing (or the last one that ran, labelled "last task").
Workflow Progress pills
Five sequential stages, each showing elapsed time once passed:
- Cooling: mandatory inter-task pause so the GPU can recover before the next call. Duration is configurable in
config.yaml. - Thinking: the model is generating but hidden reasoning tokens (
<think>…</think>) are still being produced. No visible output yet. - Streaming: visible tokens are flowing. TTFT is captured at the start of this stage.
- Evaluating: the completed response is being scored by the evaluator (exact match, extractive QA, code execution, etc.).
- Done: result written to DB. Final PASS / FAIL / ERROR determined.
Text panels
- Status: current workflow stage in uppercase.
- Prompt: the exact prompt sent to the model, including any injected system context.
- Response: the raw model output as it arrives, updated token-by-token during streaming.
- Result: the evaluator verdict and any scoring detail (expected vs. actual, score threshold, etc.).
When No Run Is Active
All panels show dashes or empty grids. A yellow waiting-indicator banner appears: "Waiting for benchmark events…". The SSE connection retries automatically. As soon as run_benchmark.py starts publishing events the monitor fills in live without any page reload.
Troubleshooting
- Charts never render: ECharts JS assets missing from
/js/— re-runscripts/init.sh. - Waiting indicator stuck: confirm the benchmark runner is active and the Flask server is reachable. Check
log/server.out.logfor SSE errors. - GPU chart flat at 0: the
app_gpu.jspoller may not have permission to read GPU metrics on this machine. CPU and temperature charts should still work. - Task pills not updating: the SSE stream may have dropped. Reload the page — the monitor reconnects and replays the current run state.