Independent · No sponsored rankings

AI developer tools, actually tested — not just described.

I spend most of my day writing code with AI assistants. I got tired of reading comparisons that clearly never ran the tools they ranked. So I built this: side-by-side evaluations on the same codebase, same tasks, measured results. The leaderboard updates when the numbers change.

See the leaderboard → How I test

tools tracked

real-world tasks

612h

test time logged

sponsor $

bench.sh — ai-tools-compared

# task: refactor 2,341-line legacy auth module

./bench run task-17 --all-tools

Running task-17 on 6 tools · same repo · same prompt

───────────────────────────────────────────────────

claude-code✓ passed (14m 02s · 3,214 tok)

cursor-composer✓ passed (18m 44s · 4,801 tok)

copilot-agent◐ partial (22m 08s · 5,602 tok)

windsurf-cascade✓ passed (16m 51s · 3,988 tok)

aider✓ passed (12m 33s · 2,104 tok)

cline✗ failed (timeout at 30m)

───────────────────────────────────────────────────

✓ 5/6 passed · avg 16m 52s · logs in ./runs/0420-1714/

The leaderboard · Apr 2026

Eighteen tools. Forty-two tasks. One scoreboard.

Every tool runs the same 42 tasks on a frozen snapshot of a real codebase. Same prompts, same models where available, same acceptance criteria. Click any column to re-sort.

LAST RUN
APR 20 · 09:14 UTC
NEXT RE-RUN
APR 27

passed partial failed Top 10 · sorted by composite score · lower is better where noted

#	Tool	Score ▼	Passed	Median time	$/task	Ctx	Notable for	Last run

Showing 10 of 18 tracked tools · View all → Methodology v4.2 · composite = 0.45·pass + 0.25·time + 0.15·cost + 0.15·code-quality

Task spotlight

Four of the tasks from the benchmark.

The full set is 42 tasks across refactoring, bug fixing, greenfield features, test writing, and migrations. These four illustrate where tools diverge hardest.

▶

Every task is open source

github.com/ai-tools-compared/suite

How I test

Same codebase. Same tasks. Measured results.

Anyone can publish a vibes-based ranking. I think the only comparison worth reading is one you could reproduce. So the methodology is open, the task suite is open, and every score has a log file behind it.

Read the full methodology →

Frozen codebase

A real 84,000-line TypeScript + Python monorepo at a pinned commit. Not a toy. Not something I wrote yesterday to flatter the tools I like.
Identical prompts

Every tool gets the exact same prompt text. No per-tool prompt engineering. If a tool needs different prompting to work well, that’s a finding, not a handicap.
Acceptance criteria up front

Each task has a list of checks written before any tool is run. Tests pass or they don’t. Lint passes or it doesn’t. “Looks fine” is not a criterion.
Three runs per task

AI tools are nondeterministic. Every task is run three times. I report the median, and I flag when the runs disagree — because that’s a quality signal too.
Public logs

Every run’s transcript, diff, and cost breakdown is in the repo. If a score looks wrong, open the log and tell me. Several scores have changed because readers did exactly that.

This week’s deep dive

The latest review.

NEW REVIEW
EVERY TUESDAY

AI Assistants for Creating Security Architecture Review

Learn how AI assistants can help automate security architecture reviews by analyzing your codebase and generating documentation

40htest time each 12real tasks $84vs $61 cost

Read the full comparison →

~/monorepo $ claude › refactor auth/ for testability

⏺ Read server.ts (2,341 lines)

⏺ Read auth/*.ts (6 files)

⏺ Traced 14 call sites

⏺ Proposing plan...

1. Extract AuthMiddleware class

2. Split session + token logic

3. Add unit test harness

4. Preserve all existing exports

? Shall I proceed?

“I started this site because every ‘best AI coding tool’ article I read was clearly written by someone who’d opened the product once. If you want numbers instead of adjectives, welcome.” The Author · writes every review on this site · day job: staff engineer

RSS feed → Read my story →

How I stay honest

No sponsored rankings. Ever.

I pay for my own subscriptions

Every tool on this site is accessed through a paid personal account. No free vendor seats, no press access, no “evaluation licenses” that come with an expectation.

⎇

The rank is set before affiliate links

Some links earn a commission, all of them disclosed. The ranking is finalized from the benchmark results before I check which programs I’m enrolled in.

◻

Re-ranked every month

New model drops constantly. I re-run the full suite monthly. When a tool moves up or down, the commit history shows exactly when and why.

One email. Every Tuesday. Numbers, not hype.

A summary of the week’s re-ran benchmarks, any score changes, and one short review. Four-minute read. No “sponsored pick.” Unsubscribe with one click.

24,300subscribers 58%open rate 0sponsored sends

AI developer tools, actually tested — not just described.

Eighteen tools. Forty-two tasks. One scoreboard.

The six I actually reach for.

Four of the tasks from the benchmark.

Same codebase. Same tasks. Measured results.

Frozen codebase

Identical prompts

Acceptance criteria up front

Three runs per task

Public logs

The latest review.

AI Assistants for Creating Security Architecture Review

Latest articles.

Find the right comparison.

AI Coding Assistants

Coding Tools

AI Code Review

AI Testing Tools

Writing Tools

Design Tools

Productivity

Security & Privacy

No sponsored rankings. Ever.

I pay for my own subscriptions

The rank is set before affiliate links

Re-ranked every month

One email. Every Tuesday. Numbers, not hype.

Tweaks