Local Model Benchmarks

Evidence-driven local routing.

I run six to nine local AI models on real work pulled from this repo's commit history, then promote or retire candidates based on acceptance rate, severity-weighted pass count, and family diversity. This page is the public face of that pipeline.

The cases come from real commits, not synthetic prompts. The promotion policy is mechanical — same metrics every run, same gates on every model. This is what I trust to decide whether to spend a Claude call or burn local GPU time.

Roles & incumbents

Every routing slot has one incumbent. Challengers must beat the incumbent on acceptance rate, review pain, and hard-failure rate without collapsing the role into a single training family.

draft

3 cases

Incumbent

gemma4:latest · google-gemma · free

Challengers

llama3.1:8b · meta-llama · free
granite3.3:8b · ibm-granite · free

reason

2 cases

Incumbent

deepseek-r1:14b · deepseek · free

No active challengers.

review

1 case

Incumbent

gemma4:26b · google-gemma · free

No active challengers.

coder

2 cases

Incumbent

qwen2.5-coder:14b · alibaba-qwen · free

No active challengers.

Latest run

No archived runs yet. Check back after the next benchmark.

Promotion policy

A challenger may replace an incumbent only if it beats on: acceptance_rate (≥ incumbent), review_pain (≤ incumbent), hard_failure_rate (≤ incumbent), cost_tier (free unless incumbent is also paid), AND preserves family_diversity (promotion must NOT leave the role single-family).

Family diversity guard. If the proposed promotion would leave a role served by only one `family` value across incumbent + challengers, reject regardless of other metrics (memory #895).

After every benchmark run; summary in reports/local-model-benchmark-YYYY-MM-DD.md.

Methodology

Cases come from real commits — not synthetic prompts. Each case in benchmarks/cases/*.json references a source commit so the task shape stays grounded.
Automated checkers only. Every case has a programmatic verifier; nothing depends on a human re-scoring run-to-run. Pipelines that need human scoring decay.
Severity weights. Silent-wrong (model returns confidently incorrect output) counts 3–5× a parse-fail in aggregate scoring — silent errors corrupt downstream work.
Family diversity. A promotion that leaves a role single-family is rejected even if it wins on every other metric. Same-family agreement is correlated, not independent.
Cost tier on every model. Free local models and paid cloud models are not interchangeable; the registry tracks this so promotion logic can prefer free tiers when quality is comparable.

Turbo escalation. When the review-role incumbent rejects on a 26b verify-reject on code-review or security-review task, the runner retries against gpt-oss:120b-cloud before falling back to a frontier-API call. Telemetry-tagged turbo-escalation.

Last updated —. Want the underlying scripts, registry, or case JSON? They live in this site's repo under benchmarks/ and scripts/benchmark_local_models.py.