draft
3 casesIncumbent
gemma4:latest · google-gemma · free
Challengers
- llama3.1:8b · meta-llama · free
- granite3.3:8b · ibm-granite · free
Leaderboards · Local models
I run six to nine local AI models on real work pulled from this repo's commit history, then promote or retire candidates based on acceptance rate, severity-weighted pass count, and family diversity. This page is the public face of that pipeline.
Scoring methodology: v1
Every routing slot has one incumbent. Challengers must beat the incumbent on acceptance rate, review pain, and hard-failure rate without collapsing the role into a single training family.
Incumbent
gemma4:latest · google-gemma · free
Challengers
Incumbent
deepseek-r1:14b · deepseek · free
No active challengers.
Incumbent
gemma4:26b · google-gemma · free
No active challengers.
Incumbent
qwen2.5-coder:14b · alibaba-qwen · free
No active challengers.
Per-model pass count + median elapsed from the most recent full run. Hard failures (silent-wrong + loud-wrong) are split out from parse failures because silent-wrong failures count 3–5× more than parse failures in aggregate scoring.
Source: reports/benchmarks/2026-05-03-granite-shakedown.json
· Captured 2026-05-04T00:16:53Z
·
raw JSON ↗
| Model | Family | Tier | Roles | Pass | Rate | Hard fail | Parse fail | Med elapsed |
|---|---|---|---|---|---|---|---|---|
| granite3.3:8b | ibm-granite | free | draft | 2/3 | 67% | 1 | 0 | 555 ms |
| hf.co/unsloth/granite-4.1-8b-GGUF:Q4_K_M | ibm-granite | free | draft | 2/3 | 67% | 1 | 0 | 801 ms |
A challenger may replace an incumbent only if it beats on: acceptance_rate (≥ incumbent), review_pain (≤ incumbent), hard_failure_rate (≤ incumbent), cost_tier (free unless incumbent is also paid), AND preserves family_diversity (promotion must NOT leave the role single-family).
Family diversity guard. If the proposed promotion would leave a role served by only one `family` value across incumbent + challengers, reject regardless of other metrics (memory #895).
After every benchmark run; summary in reports/local-model-benchmark-YYYY-MM-DD.md.
Every full and partial run is archived under reports/benchmarks/. Newest first.
Methodology + severity weights: /leaderboards/methodology. JSON API: /leaderboards/local-models.json.