Leaderboards · Methodology

Methodology.

How every leaderboard on this site is scored. The rubric carries a version number; bumping it stamps existing entries with a "scored under prior version" note rather than silently overwriting.

Scoring methodology: v1

Principles

Cases come from real commits. Each LLM case in benchmarks/cases/*.json references a source commit so the task shape stays grounded — not synthetic prompts.
Automated checkers only. Every case has a programmatic verifier; nothing depends on a human re-scoring run-to-run. Pipelines that need human scoring decay.
Severity weights. Silent-wrong (model returns confidently incorrect output) counts 3–5× a parse-fail in aggregate scoring. Silent errors corrupt downstream work.
Family diversity. A model promotion that leaves a role single-family is rejected even if it wins on every other metric. Same-family agreement is correlated, not independent.
Cost tier on every model. Free local models and paid cloud models are not interchangeable; the registry tracks this so promotion logic can prefer free tiers when quality is comparable.
Stock vs Modded never share a board. Modded hardware entries route exclusively to /leaderboards/mods. Mixing stock and modded ranks is the 3DMark Hall of Fame anti-pattern that destroys trust.
Pair every thermal claim with a noise figure. Pair every perf claim with a power figure. Single-axis hardware claims invite gaming. Where data wasn't captured, render 'n/a — reason' explicitly.
Per-row reproduction recipe. For models: prompt-set hash + sampler params + quant + runner version + hardware. For hardware: exact mod, ambient, fan curve, run duration. Where data is missing on existing rows we render 'n/a — not captured for this run' rather than fabricate.

Pre-bench protocol

Before any GPU benchmark run, a short eviction checklist runs to prevent VRAM-pressure invalidating the result. The full protocol lives in the repo:

vram-eviction-checklist.md ↗

Raw data

Every score on this site links to its source file. Repository paths:

reports/benchmarks/*.json — top-level benchmark runner output
reports/benchmarks/<box>-<scenario>-<date>/ — per-session run dirs
reports/baseline-2026-04-17/ — 3DMark + Cinebench + HWiNFO baselines
benchmarks/models.json — model registry + promotion policy
benchmarks/cases/*.json — per-case prompt + checker definitions

Each leaderboard also exposes its data as JSON at /leaderboards/<section>.json — same dict the HTML renders, jsonify'd unmodified.

Conflict of interest

Hardware was self-funded. No vendor relationships. The repair shop that performed the 3090 FE pad swap charged $35 labor; pads and paste were owner-supplied. No sponsorships.

Back to: /leaderboards · /local-models · /hardware · /mods.