Benchmarks re-run this morning · Apr 20, 2026 · 09:14 UTC
v4.2 methodology · 18 tools tracked · 42 tasks
Independent · No sponsored rankings

AI developer tools, actually testednot just described.

I spend most of my day writing code with AI assistants. I got tired of reading comparisons that clearly never ran the tools they ranked. So I built this: side-by-side evaluations on the same codebase, same tasks, measured results. The leaderboard updates when the numbers change.

18
tools tracked
42
real-world tasks
612h
test time logged
$0
sponsor $
The leaderboard · Apr 2026

Eighteen tools. Forty-two tasks. One scoreboard.

Every tool runs the same 42 tasks on a frozen snapshot of a real codebase. Same prompts, same models where available, same acceptance criteria. Click any column to re-sort.

LAST RUN
APR 20 · 09:14 UTC
NEXT RE-RUN
APR 27
passed partial failed Top 10 · sorted by composite score · lower is better where noted
# Tool Score Passed Median time $/task Ctx Notable for Last run
Showing 10 of 18 tracked tools · View all → Methodology v4.2 · composite = 0.45·pass + 0.25·time + 0.15·cost + 0.15·code-quality
In-depth · verdicts

The six I actually reach for.

Each card is the short version of a 3,000-word review. Honest pros, honest cons, and the specific case each tool is actually best for — not a generic “it depends.”

Task spotlight

Four of the tasks from the benchmark.

The full set is 42 tasks across refactoring, bug fixing, greenfield features, test writing, and migrations. These four illustrate where tools diverge hardest.

Every task is open source
github.com/ai-tools-compared/suite
How I test

Same codebase. Same tasks. Measured results.

Anyone can publish a vibes-based ranking. I think the only comparison worth reading is one you could reproduce. So the methodology is open, the task suite is open, and every score has a log file behind it.

Read the full methodology
  1. Frozen codebase

    A real 84,000-line TypeScript + Python monorepo at a pinned commit. Not a toy. Not something I wrote yesterday to flatter the tools I like.

  2. Identical prompts

    Every tool gets the exact same prompt text. No per-tool prompt engineering. If a tool needs different prompting to work well, that’s a finding, not a handicap.

  3. Acceptance criteria up front

    Each task has a list of checks written before any tool is run. Tests pass or they don’t. Lint passes or it doesn’t. “Looks fine” is not a criterion.

  4. Three runs per task

    AI tools are nondeterministic. Every task is run three times. I report the median, and I flag when the runs disagree — because that’s a quality signal too.

  5. Public logs

    Every run’s transcript, diff, and cost breakdown is in the repo. If a score looks wrong, open the log and tell me. Several scores have changed because readers did exactly that.

This week’s deep dive

The latest review.

NEW REVIEW
EVERY TUESDAY

AI Assistants for Creating Security Architecture Review

Learn how AI assistants can help automate security architecture reviews by analyzing your codebase and generating documentation

40htest time each 12real tasks $84vs $61 cost
Read the full comparison
~/monorepo $ claude › refactor auth/ for testability
Read server.ts (2,341 lines)
Read auth/*.ts (6 files)
Traced 14 call sites
Proposing plan...
  1. Extract AuthMiddleware class
  2. Split session + token logic
  3. Add unit test harness
  4. Preserve all existing exports
? Shall I proceed?
Recently updated

Latest articles.

Browse all
Browse by topic

Find the right comparison.

A
“I started this site because every ‘best AI coding tool’ article I read was clearly written by someone who’d opened the product once. If you want numbers instead of adjectives, welcome.” The Author · writes every review on this site · day job: staff engineer
How I stay honest

No sponsored rankings. Ever.

$

I pay for my own subscriptions

Every tool on this site is accessed through a paid personal account. No free vendor seats, no press access, no “evaluation licenses” that come with an expectation.

The rank is set before affiliate links

Some links earn a commission, all of them disclosed. The ranking is finalized from the benchmark results before I check which programs I’m enrolled in.

Re-ranked every month

New model drops constantly. I re-run the full suite monthly. When a tool moves up or down, the commit history shows exactly when and why.

One email. Every Tuesday. Numbers, not hype.

A summary of the week’s re-ran benchmarks, any score changes, and one short review. Four-minute read. No “sponsored pick.” Unsubscribe with one click.

24,300subscribers 58%open rate 0sponsored sends
Tweaks