Start here
The big picture 🗺️
Every AI here is trying to fix real bugs in real open-source projects. Two things matter: how many bugs it fixes (capability) and how much that costs (price).
Most leaderboards only show the first. This page shows both, and combines them into one number — the Value Score — so you can see who gives the most fixing-power per dollar.
The test
What's a "resolve"? 🐛→✅
SWE-bench gives the AI a real GitHub issue (a bug report) and the project's code. The AI writes a patch. The patch is "resolved" only if, after applying it, the project's hidden tests pass.
Read the bug ticket + the code.
Write a code change (a patch).
Run the project's tests. All green → resolved ✅. Any red → not resolved.
SWE-bench Lite = 300 bugs · Verified = 500. The score is "% of bugs resolved".
Axis 1
Capability = Resolve % 📊
The share of bugs the AI actually fixed. Higher is better. A model at 50% fixed half the bugs it was given.
But capability alone hides the catch: the 79% model might cost 500× more to run. That's why we need price.
Axis 2
Price = Cost per instance 💵
What it costs the AI to attempt one bug — mostly the price of the tokens the model reads and writes. Big "frontier" models charge a lot per token and use huge contexts; small/cheap models cost cents.
That's a 400–600× difference for one bug. It compounds fast — which is the next page.
Why price really matters
The real bill = Run total 🧾
You rarely fix one bug. To run the whole benchmark you pay cost × number-of-bugs.
Darwin Lite: $0.005 × 300 = $1.50
Frontier Lite: $2.50 × 300 = $750
Same benchmark. One costs a coffee; the other costs a week of groceries — for a similar number of fixes.
The calculus
The Value Score ✦
To rank by both at once, we turn each axis into a 0–100 number and blend them:
Capability = the resolve % directly (0–100).
Cheapness = price on a log scale, mapped so $5/bug → 0 and $0.005/bug → 100. (Log, because going $2→$1 matters as much as $0.02→$0.01.)
Blend them with a weight w you choose:
The slider above the table is w. Drag it toward capability and frontier models rise; drag toward cost and the cheap-but-capable systems win. At the middle (50/50), you're asking "who's the best all-rounder per dollar?"
The picture
Reading the chart 📈
Each dot is one system. Left = cheaper, up = more capable. So the top-left is the dream (cheap and good).
The dashed green line is the cost-Pareto frontier: the best resolve achievable at each price. Nothing beats a system sitting on it without paying more. Darwin anchors the cheap end.
Trust
Honesty & proof 🔍
We separate what's measured from what's estimated:
Resolve % (all systems) are real — from the official SWE-bench results.
Competitor costs are estimated from each system's disclosed model × public token prices (marked est). "Undisclosed" where no model is public.
Darwin's numbers are measured: resolve via the official harness, cost from real API spend. Darwin is also conformant — it never sees the answer tests while solving.
Sources: SWE-bench experiments repo · swebench.com. All linked at the bottom of the page.