draft
3 casesIncumbent
gemma4:latest · google-gemma · free
Challengers
- llama3.1:8b · meta-llama · free
- granite3.3:8b · ibm-granite · free
Local Model Benchmarks
I run six to nine local AI models on real work pulled from this repo's commit history, then promote or retire candidates based on acceptance rate, severity-weighted pass count, and family diversity. This page is the public face of that pipeline.
The cases come from real commits, not synthetic prompts. The promotion policy is mechanical — same metrics every run, same gates on every model. This is what I trust to decide whether to spend a Claude call or burn local GPU time.
Every routing slot has one incumbent. Challengers must beat the incumbent on acceptance rate, review pain, and hard-failure rate without collapsing the role into a single training family.
Incumbent
gemma4:latest · google-gemma · free
Challengers
Incumbent
deepseek-r1:14b · deepseek · free
No active challengers.
Incumbent
gemma4:26b · google-gemma · free
No active challengers.
Incumbent
qwen2.5-coder:14b · alibaba-qwen · free
No active challengers.
No archived runs yet. Check back after the next benchmark.
A challenger may replace an incumbent only if it beats on: acceptance_rate (≥ incumbent), review_pain (≤ incumbent), hard_failure_rate (≤ incumbent), cost_tier (free unless incumbent is also paid), AND preserves family_diversity (promotion must NOT leave the role single-family).
Family diversity guard. If the proposed promotion would leave a role served by only one `family` value across incumbent + challengers, reject regardless of other metrics (memory #895).
After every benchmark run; summary in reports/local-model-benchmark-YYYY-MM-DD.md.
benchmarks/cases/*.json
references a source commit so the task shape stays grounded.
Turbo escalation.
When the review-role incumbent
rejects on a 26b verify-reject on code-review or security-review task,
the runner retries against
gpt-oss:120b-cloud
before falling back to a frontier-API call. Telemetry-tagged
turbo-escalation.
Last updated —. Want the underlying
scripts, registry, or case JSON? They live in this site's repo
under benchmarks/ and
scripts/benchmark_local_models.py.