Skip to content

Leaderboards · Local models

Evidence-driven local routing.

I run six to nine local AI models on real work pulled from this repo's commit history, then promote or retire candidates based on acceptance rate, severity-weighted pass count, and family diversity. This page is the public face of that pipeline.

Scoring methodology: v1

Roles & incumbents

Every routing slot has one incumbent. Challengers must beat the incumbent on acceptance rate, review pain, and hard-failure rate without collapsing the role into a single training family.

draft

3 cases

Incumbent

gemma4:latest · google-gemma · free

Challengers

  • llama3.1:8b · meta-llama · free
  • granite3.3:8b · ibm-granite · free

reason

2 cases

Incumbent

deepseek-r1:14b · deepseek · free

No active challengers.

review

1 case

Incumbent

gemma4:26b · google-gemma · free

No active challengers.

coder

2 cases

Incumbent

qwen2.5-coder:14b · alibaba-qwen · free

No active challengers.

Latest run

Per-model pass count + median elapsed from the most recent full run. Hard failures (silent-wrong + loud-wrong) are split out from parse failures because silent-wrong failures count 3–5× more than parse failures in aggregate scoring.

Source: reports/benchmarks/2026-05-03-granite-shakedown.json · Captured 2026-05-04T00:16:53Z · raw JSON ↗

Model Family Tier Roles Pass Rate Hard fail Parse fail Med elapsed
granite3.3:8b ibm-granite free draft 2/3 67% 1 0 555 ms
hf.co/unsloth/granite-4.1-8b-GGUF:Q4_K_M ibm-granite free draft 2/3 67% 1 0 801 ms

Promotion policy

A challenger may replace an incumbent only if it beats on: acceptance_rate (≥ incumbent), review_pain (≤ incumbent), hard_failure_rate (≤ incumbent), cost_tier (free unless incumbent is also paid), AND preserves family_diversity (promotion must NOT leave the role single-family).

Family diversity guard. If the proposed promotion would leave a role served by only one `family` value across incumbent + challengers, reject regardless of other metrics (memory #895).

After every benchmark run; summary in reports/local-model-benchmark-YYYY-MM-DD.md.

Run archive

Every full and partial run is archived under reports/benchmarks/. Newest first.

Methodology + severity weights: /leaderboards/methodology. JSON API: /leaderboards/local-models.json.