Code Review Benchmark

Objective AI Code Review Tool Evaluation

Reproducible benchmarks comparing open-source AI code review tools against curated PRs with known issues

5 Tools Tested

12 Challenges

— Test Runs

44 Issues to Find

Last updated: Loading...

How It Works

Curated Challenges

Real-world PRs with known issues spanning security, bugs, and performance

Run Tools

Each AI reviewer analyzes the same PR diffs in controlled conditions

Score & Compare

Measure precision, recall, and F1 against ground truth issues

Evaluation Methodology

How we measure tool performance — transparent, reproducible, and open source.

What We Measure

Precision Of everything the tool flagged, how much was a real issue?

Recall Of all known issues in the PR, how many did the tool find?

F1 Score The balanced combination of precision and recall (harmonic mean).

F2 Score A recall-weighted F-score that values finding issues 2× more than avoiding false alarms — our primary ranking metric.

Two-Phase Matching

Each tool's findings are matched against ground truth issues using:

Heuristic pre-match — file path (40%), line proximity (20%), keyword overlap (40%)
LLM-as-judge — a separate LLM evaluates semantic equivalence for close matches

Final score = 40% heuristic + 60% LLM confidence. One-to-one matching prevents inflated scores.

Agents vs. Pure Models

Review Agents (PR-Agent, Shippie) are dedicated tools with their own prompts, parsing, and review logic.

Pure Model Baselines (OpenAI, Claude, Gemini) are general-purpose LLMs called directly with a shared prompt. They measure what a raw LLM can do without specialized tooling.

Fairness & Transparency

Judge: Claude Sonnet 4 — chosen because most tools use OpenAI models, reducing same-provider bias. Configurable via CRB_JUDGE_MODEL
Consistency: Each tool/challenge pair runs multiple times; mean ± stddev reported
Isolation: Every run uses a fresh temporary git repository

Overall Performance Rankings

Sort by:

Performance Matrix

F1 Score:

< 0.5 0.5 - 0.7 0.7 - 0.9 > 0.9