Leaderboards · Methodology
Methodology.
How every leaderboard on this site is scored. The rubric carries a version number; bumping it stamps existing entries with a "scored under prior version" note rather than silently overwriting.
Scoring methodology: v1
Principles
- Cases come from real commits. Each LLM case in benchmarks/cases/*.json references a source commit so the task shape stays grounded — not synthetic prompts.
- Automated checkers only. Every case has a programmatic verifier; nothing depends on a human re-scoring run-to-run. Pipelines that need human scoring decay.
- Severity weights. Silent-wrong (model returns confidently incorrect output) counts 3–5× a parse-fail in aggregate scoring. Silent errors corrupt downstream work.
- Family diversity. A model promotion that leaves a role single-family is rejected even if it wins on every other metric. Same-family agreement is correlated, not independent.
- Cost tier on every model. Free local models and paid cloud models are not interchangeable; the registry tracks this so promotion logic can prefer free tiers when quality is comparable.
- Stock vs Modded never share a board. Modded hardware entries route exclusively to /leaderboards/mods. Mixing stock and modded ranks is the 3DMark Hall of Fame anti-pattern that destroys trust.
- Pair every thermal claim with a noise figure. Pair every perf claim with a power figure. Single-axis hardware claims invite gaming. Where data wasn't captured, render 'n/a — reason' explicitly.
- Per-row reproduction recipe. For models: prompt-set hash + sampler params + quant + runner version + hardware. For hardware: exact mod, ambient, fan curve, run duration. Where data is missing on existing rows we render 'n/a — not captured for this run' rather than fabricate.
Pre-bench protocol
Before any GPU benchmark run, a short eviction checklist runs to prevent VRAM-pressure invalidating the result. The full protocol lives in the repo:
Raw data
Every score on this site links to its source file. Repository paths:
reports/benchmarks/*.json — top-level benchmark runner outputreports/benchmarks/<box>-<scenario>-<date>/ — per-session run dirsreports/baseline-2026-04-17/ — 3DMark + Cinebench + HWiNFO baselinesbenchmarks/models.json — model registry + promotion policybenchmarks/cases/*.json — per-case prompt + checker definitions
Each leaderboard also exposes its data as JSON at
/leaderboards/<section>.json — same
dict the HTML renders, jsonify'd unmodified.
Conflict of interest
Hardware was self-funded. No vendor relationships. The repair shop that performed the 3090 FE pad swap charged $35 labor; pads and paste were owner-supplied. No sponsorships.
Back to: /leaderboards · /local-models · /hardware · /mods.