Leaderboards · Local models

Evidence-driven local routing.

I run six to nine local AI models on real work pulled from this repo's commit history, then promote or retire candidates based on acceptance rate, severity-weighted pass count, and family diversity. This page is the public face of that pipeline.

Scoring methodology: v1

Roles & incumbents

Every routing slot has one incumbent. Challengers must beat the incumbent on acceptance rate, review pain, and hard-failure rate without collapsing the role into a single training family.

draft

3 cases

Incumbent

gemma4:latest · google-gemma · free

Challengers

llama3.1:8b · meta-llama · free
granite3.3:8b · ibm-granite · free

reason

2 cases

Incumbent

deepseek-r1:14b · deepseek · free

No active challengers.

review

1 case

Incumbent

gemma4:26b · google-gemma · free

No active challengers.

coder

2 cases

Incumbent

qwen2.5-coder:14b · alibaba-qwen · free

No active challengers.

Latest run

Per-model pass count + median elapsed from the most recent full run. Hard failures (silent-wrong + loud-wrong) are split out from parse failures because silent-wrong failures count 3–5× more than parse failures in aggregate scoring.

Source: reports/benchmarks/2026-05-03-granite-shakedown.json · Captured 2026-05-04T00:16:53Z · raw JSON ↗

Model	Family	Tier	Roles	Pass	Rate	Hard fail	Parse fail	Med elapsed
granite3.3:8b	ibm-granite	free	draft	2/3	67%	1	0	555 ms
hf.co/unsloth/granite-4.1-8b-GGUF:Q4_K_M	ibm-granite	free	draft	2/3	67%	1	0	801 ms

Promotion policy

A challenger may replace an incumbent only if it beats on: acceptance_rate (≥ incumbent), review_pain (≤ incumbent), hard_failure_rate (≤ incumbent), cost_tier (free unless incumbent is also paid), AND preserves family_diversity (promotion must NOT leave the role single-family).

Family diversity guard. If the proposed promotion would leave a role served by only one `family` value across incumbent + challengers, reject regardless of other metrics (memory #895).

After every benchmark run; summary in reports/local-model-benchmark-YYYY-MM-DD.md.

Run archive

Every full and partial run is archived under reports/benchmarks/. Newest first.

2026-05-03-granite-shakedown.json JSON
2026-05-04-phi4-reasoning-shakedown.json JSON
casepack-revisit-2026-04-21.json JSON
full-2026-04-21.json JSON
gitref-smoke-2026-04-21.json JSON
phanteks-rig-bench-plan-2026-05-07.md MD
seller-pc-3090-alone-large-2026-04-27/ RUN
seller-pc-3090-fe-fresh-pads-2026-04-28/ RUN
smoke-2026-04-21-v2.json JSON
smoke-2026-04-21.json JSON

Methodology + severity weights: /leaderboards/methodology. JSON API: /leaderboards/local-models.json.