I spend most of my day writing code with AI assistants. I got tired of reading comparisons that clearly never ran the tools they ranked. So I built this: side-by-side evaluations on the same codebase, same tasks, measured results. The leaderboard updates when the numbers change.
Every tool runs the same 42 tasks on a frozen snapshot of a real codebase. Same prompts, same models where available, same acceptance criteria. Click any column to re-sort.
| # | Tool | Score ▼ |
|---|
Each card is the short version of a 3,000-word review. Honest pros, honest cons, and the specific case each tool is actually best for — not a generic “it depends.”
The full set is 42 tasks across refactoring, bug fixing, greenfield features, test writing, and migrations. These four illustrate where tools diverge hardest.
Anyone can publish a vibes-based ranking. I think the only comparison worth reading is one you could reproduce. So the methodology is open, the task suite is open, and every score has a log file behind it.
Read the full methodology →A real 84,000-line TypeScript + Python monorepo at a pinned commit. Not a toy. Not something I wrote yesterday to flatter the tools I like.
Every tool gets the exact same prompt text. No per-tool prompt engineering. If a tool needs different prompting to work well, that’s a finding, not a handicap.
Each task has a list of checks written before any tool is run. Tests pass or they don’t. Lint passes or it doesn’t. “Looks fine” is not a criterion.
AI tools are nondeterministic. Every task is run three times. I report the median, and I flag when the runs disagree — because that’s a quality signal too.
Every run’s transcript, diff, and cost breakdown is in the repo. If a score looks wrong, open the log and tell me. Several scores have changed because readers did exactly that.
Learn how AI assistants can help automate security architecture reviews by analyzing your codebase and generating documentation
Copilot, Cursor, Claude Code, Windsurf
Autocomplete, code generation, pair programming
Automated PR review, CodeRabbit, Sourcery
pytest, Jest, Playwright test generation
AI content generation and editing
Image generation, UI mockups, visual AI
CI/CD, DevOps, budgets, workflow optimization
Audits, GDPR, data protection
Every tool on this site is accessed through a paid personal account. No free vendor seats, no press access, no “evaluation licenses” that come with an expectation.
Some links earn a commission, all of them disclosed. The ranking is finalized from the benchmark results before I check which programs I’m enrolled in.
New model drops constantly. I re-run the full suite monthly. When a tool moves up or down, the commit history shows exactly when and why.
A summary of the week’s re-ran benchmarks, any score changes, and one short review. Four-minute read. No “sponsored pick.” Unsubscribe with one click.