Reproducible benchmarks comparing open-source AI code review tools against curated PRs with known issues
Real-world PRs with known issues spanning security, bugs, and performance
Each AI reviewer analyzes the same PR diffs in controlled conditions
Measure precision, recall, and F1 against ground truth issues
How we measure tool performance — transparent, reproducible, and open source.
Each tool's findings are matched against ground truth issues using:
Final score = 40% heuristic + 60% LLM confidence. One-to-one matching prevents inflated scores.
Review Agents (PR-Agent, Shippie) are dedicated tools with their own prompts, parsing, and review logic.
Pure Model Baselines (OpenAI, Claude, Gemini) are general-purpose LLMs called directly with a shared prompt. They measure what a raw LLM can do without specialized tooling.
CRB_JUDGE_MODELAdd your favorite AI code reviewer or contribute new challenge scenarios