Skip to content

Local Model Benchmarks

Evidence-driven local routing.

I run six to nine local AI models on real work pulled from this repo's commit history, then promote or retire candidates based on acceptance rate, severity-weighted pass count, and family diversity. This page is the public face of that pipeline.

The cases come from real commits, not synthetic prompts. The promotion policy is mechanical — same metrics every run, same gates on every model. This is what I trust to decide whether to spend a Claude call or burn local GPU time.

Roles & incumbents

Every routing slot has one incumbent. Challengers must beat the incumbent on acceptance rate, review pain, and hard-failure rate without collapsing the role into a single training family.

draft

3 cases

Incumbent

gemma4:latest · google-gemma · free

Challengers

  • llama3.1:8b · meta-llama · free
  • granite3.3:8b · ibm-granite · free

reason

2 cases

Incumbent

deepseek-r1:14b · deepseek · free

No active challengers.

review

1 case

Incumbent

gemma4:26b · google-gemma · free

No active challengers.

coder

2 cases

Incumbent

qwen2.5-coder:14b · alibaba-qwen · free

No active challengers.

Latest run

No archived runs yet. Check back after the next benchmark.

Promotion policy

A challenger may replace an incumbent only if it beats on: acceptance_rate (≥ incumbent), review_pain (≤ incumbent), hard_failure_rate (≤ incumbent), cost_tier (free unless incumbent is also paid), AND preserves family_diversity (promotion must NOT leave the role single-family).

Family diversity guard. If the proposed promotion would leave a role served by only one `family` value across incumbent + challengers, reject regardless of other metrics (memory #895).

After every benchmark run; summary in reports/local-model-benchmark-YYYY-MM-DD.md.

Methodology

Turbo escalation. When the review-role incumbent rejects on a 26b verify-reject on code-review or security-review task, the runner retries against gpt-oss:120b-cloud before falling back to a frontier-API call. Telemetry-tagged turbo-escalation.

Last updated —. Want the underlying scripts, registry, or case JSON? They live in this site's repo under benchmarks/ and scripts/benchmark_local_models.py.