SWE-bench Lite · 300 instances

The Cost–Performance Pareto Frontier

The cheapest system that reaches each level of capability — ranked by resolve-per-dollar, not raw score. One tunable Value Score blends capability and price; drag the slider to set your cost-vs-capability tradeoff. Benchmarks are tabs below. New here? Hit “How it works”.

● real leaderboard scores ● official harness · Wilson 95% CI ● competitor costs: estimated from disclosed models

The real story is the price, not the score

~$0.005–$0.015 per instance. A full sweep of all 300 SWE-bench Lite problems costs roughly $1.50 in model spend for a single conformant trajectory, and ~$4.50 for the Best-of-3 + judge pipeline (measured OpenRouter spend).

The systems above it run an estimated $0.1–$2.5+ per instance; the heaviest multi-agent frontier scaffolds are reported at $15–$30/instance — $4,500–$9,000 to evaluate the same 300 problems once. That's a two-to-three-order-of-magnitude efficiency gap at comparable resolve, and it holds regardless of minor percentage swings — because the lever here is the harness (a stateful interactive loop gated by the repo's own tests), not brute-force MoE compute. Cost is the moat; resolve % is the table stakes. (Darwin costs measured; competitor costs estimated from disclosed models — see notes below.)

The frontier

Darwin (measured, conformant) Darwin (pilot, full-300 pending) official leaderboard (real %, est cost) cost-Pareto frontier

Leaderboard — sorted by Value Score

cost ◀ priorities ▶ capability

Value	System	Scaffold	Resolve %	$ / inst	Run total	Resolve / $	Evidence

The Cost–Performance Pareto Frontier

Most leaderboards rank by score alone. This one ranks by resolve-per-dollar — the cheapest harness that reaches each level of capability.

Each dot is a system; the green line is the frontier — nothing beats it on cost and capability. Drag the Value Score slider to set your own cost-vs-capability tradeoff.

The real story is the price, not the score

The frontier

Leaderboard — sorted by Value Score

Start here

The big picture 🗺️

The test

What's a "resolve"? 🐛→✅

Axis 1

Capability = Resolve % 📊

Axis 2

Price = Cost per instance 💵

Why price really matters

The real bill = Run total 🧾

The calculus

The Value Score ✦

The picture

Reading the chart 📈

Trust

Honesty & proof 🔍

The Cost–Performance Pareto Frontier

🧮 Cost / performance calculator