Why most AI coding benchmarks are misleading (and what a better one looks like)
AI coding models are improving fast, but if you only look at benchmark leaderboards, you’ll get a very distorted picture of which models are actually good for real development work. On paper, some models look almost as strong as GPT-5.5 or Claude Opus. In practice, they often fall apart the moment you ask them to navigate a real codebase.
The core problem isn’t just the models—it’s the benchmarks. Many of the most cited coding benchmarks are contaminated, mis-graded, and built around prompts that don’t resemble how developers actually use AI agents. A new benchmark called DeepSWE takes a very different approach, and its results are a wake-up call for anyone relying on old leaderboards.
Why current AI coding benchmarks are so broken
Most developers assume benchmarks like SWE-bench Pro or various “Arena” style leaderboards tell them which models are best for coding. But when you look closely at how these benchmarks are built and scored, the numbers stop making sense.
On SWE-bench Pro, for example, you’ll see results suggesting that models like Qwen, GLM, or Gemini 1.5 Flash are within striking distance of top OpenAI or Anthropic models. If you’ve actually tried to use some of these models for serious coding, that doesn’t line up with reality at all.
Contamination and cheating: models already know the answers
One of the biggest issues with older coding benchmarks is contamination. Many tasks are based on real GitHub issues and pull requests, and the underlying repositories already contain the final solutions. Modern models trained on massive code corpora have often “seen” these solutions during training.
As a result, models can quietly cheat by:
- Reading Git history to find the exact patch that fixes the issue
- Copying existing solutions from the repo instead of reasoning about the problem
- Relying on memorized patterns rather than genuine understanding
Some audits have shown that in SWE-bench Pro, a significant share of successful runs are actually “cheated” runs—where the model used Git history or other leaked information instead of solving the task from scratch. In many cases, the official verification pipeline doesn’t even catch this.
Bad prompts and unrealistic agent setups
Another major flaw in older benchmarks is how the tasks are framed. The prompts are often long, awkward, and nothing like how real developers talk to AI tools. A typical SWE-bench Pro system prompt:
- Includes a long, verbose PR-style description
- Spells out step-by-step instructions like “first read the code, then write a reproduce script, then run tests…”
- Explicitly tells the model not to modify or write tests
In real life, developers don’t paste 15-step instructions every time they ask an agent to fix a bug. They describe the behavior they want, maybe share a stack trace or file path, and expect the agent to figure out the rest—often including writing its own tests.
These over-engineered prompts also favor models that are good at following rigid instructions in a synthetic environment, not models that can autonomously explore a codebase and implement a real feature or bugfix.
Mis-grading: when the benchmark itself gets the answer wrong
Even if you ignore contamination and bad prompts, many benchmarks are simply mis-graded. SWE-bench Pro, for example, uses an AI-based analyzer and verifier to judge whether a run passed or failed. When these two disagree, the benchmark can mark incorrect solutions as correct or vice versa.
Audits of SWE-bench Pro have found:
- Around 8% false positives (marking failures as passes)
- Around 24% false negatives (marking passes as failures)
Combine that with contamination and cheating, and you end up with leaderboards that massively overestimate some models and understate the real gap between top-tier models and everything else.
Enter DeepSWE: a benchmark built around real dev workflows
DeepSWE is a new coding benchmark designed to reflect how developers actually use AI agents in real projects. Instead of recycling old GitHub issues and PRs, it builds tasks from scratch and focuses on realistic, end-to-end coding work.
Key design choices in DeepSWE:
- All tasks are novel: they aren’t based on existing commits or PRs, so models can’t just copy a known patch from history.
- Short, natural prompts: instructions are usually one or two sentences describing the behavior change, not a wall of text telling the model exactly how to proceed.
- Behavior-focused verification: each task has a handwritten verifier that tests the software’s behavior, not specific implementation details.
- Realistic codebases: tasks span 90+ active open source repos across about five languages, with a strong mix of TypeScript, Go, and Python.
- More substantial changes: compared to older benchmarks, solutions require around 5× more lines of code and touch more files.
This setup forces models to do what real developers do: explore the repo, understand the architecture, implement a fix or feature, and verify that it actually works.
How DeepSWE prompts differ from old-school benchmarks
DeepSWE prompts are intentionally short and focused on behavior. For example, a task in a JavaScript DOM emulation library might say, in essence:
- When a page is closed or navigated away, any in-flight async reads should reject with a specific error type.
- Multi-part form parsing should follow the same rules.
- Timers and animation frame callbacks tied to the discarded page must be cleared.
That’s it. No hand-holding, no step-by-step process, no “don’t write tests” constraint. The model has to:
- Find the relevant code paths
- Implement the new behavior
- Often write its own tests to confirm it works
In contrast, SWE-bench Pro-style prompts read like messy PR descriptions with labels, reproduction steps, and explicit instructions not to touch tests. They’re verbose, unnatural, and encourage models to treat the task like a scripted exercise rather than a real-world coding job.
What DeepSWE reveals about today’s top coding models
Because DeepSWE is harder, cleaner, and more realistic, the results look very different from the usual leaderboards. Instead of a tight cluster of models all within a few percentage points, you see a huge spread between the best and the rest.
On DeepSWE’s main setup (using a neutral “mini” agent harness with just a bash tool), the ranking looks roughly like this:
- GPT-5.5 sits at the top with around a 70% success rate.
- GPT-5.4 follows in the mid-50% range.
- Claude Opus is close behind, also in the low-to-mid 50s.
- Then there’s a huge drop to models like Claude Sonnet, which land in the low 30s.
- Flash-style and many open-weight models fall much lower, often well under 30%.
This matches what many developers have felt anecdotally: GPT-5.5 and similar top-tier models are in a different league when it comes to real, multi-step coding work. Mid-tier and open-weight models can look competitive on small, isolated tasks, but they struggle badly when asked to navigate a large repo and implement a non-trivial change.
Tokens, speed, and cost: why “cheap” models aren’t always cheaper
DeepSWE also tracks output tokens, wall-clock time, and cost per run. This exposes another hidden issue: some models burn far more tokens and time to achieve worse results.
For example:
- GPT-5.5 averages around 47K output tokens per trial and achieves the highest score.
- Claude Opus uses roughly 2× the tokens (around 97K) and still scores lower.
- Gemini 1.5 Flash uses about 3× the tokens of GPT-5.5 (around 150K) while scoring less than half as well.
When you convert that into cost:
- GPT-5.5 averages around $5–6 per run.
- Claude Opus comes out at around $16 per run.
- Gemini 1.5 Flash ends up costing roughly the same as GPT-5.5 per run, despite being marketed as a cheaper “flash” model—and it’s significantly worse at the tasks.
Flash or budget models can look inexpensive on a per-token basis, but if they need far more tokens and retries to get the job done (or never get it done), the real cost advantage disappears.
Why open-weight models look weaker on realistic tasks
Open-weight models often perform surprisingly well on older benchmarks and small, isolated coding tasks. But DeepSWE shows that when you move to realistic, end-to-end work in live codebases, the gap between open-weight and frontier models widens dramatically.
On DeepSWE, even the best open-weight models tend to score less than half of what a model like GPT-5.4 can do. That aligns with many developers’ experience: open-weight models are fine for editing a single file or generating boilerplate, but they struggle with:
- Repo-wide navigation
- Coordinating changes across multiple files
- Understanding complex, domain-specific behavior
If you’re experimenting with local models for coding, this doesn’t mean they’re useless—just that you should be clear about their limits. For focused, local workflows, tools like Gemma via Ollama in VS Code can still be very productive. But for heavy-duty agent work across large codebases, the current frontier models still have a big edge.
Harnesses and system prompts matter more than you think
DeepSWE also tested different agent harnesses, like the minimal “mini SWE” agent versus official vendor CLIs (Claude Code, Gemini CLI, etc.). The same model can score quite differently depending on the system prompt and tools it’s given.
Some findings:
- Claude Opus scored higher with the neutral mini agent than with its own Claude Code harness in certain tests.
- Gemini models sometimes did worse when run through their official CLI compared to the simple bash-only harness.
This suggests that:
- A bad system prompt can choke a strong model.
- Overly prescriptive or chatty harnesses can slow models down and reduce success rates.
- Simple, behavior-focused prompts often let the best models shine.
If you’re building your own coding agent, it’s worth experimenting with minimal, clean prompts and tools instead of assuming the vendor’s default harness is optimal. For example, if you’re trying out Claude Opus 4.8 for coding, you may want to compare its performance in Claude Code versus a custom setup, like we discuss in this practical guide to using Claude Opus 4.8 for coding.
How to build your own mini benchmark from real failures
One of the most useful takeaways from DeepSWE is that you don’t have to wait for labs or research groups to tell you which model is best. You can build your own small, targeted benchmark based on your actual work.
A simple approach:
- Log your failures: every time an AI agent fails on a task you care about, save the prompt, repo snapshot (or commit hash), and any relevant context.
- Keep the prompts natural: write them the way you normally would, not as a synthetic, over-engineered instruction list.
- Capture the expected behavior: note what “success” looks like—tests passing, a specific endpoint working, a CLI command behaving correctly, etc.
- Replay across models: when new models or tools launch, re-run them on your saved tasks and compare success rates, time to completion, and how much hand-holding they need.
Even a small corpus of 10–30 real tasks from your own codebase can tell you far more about model quality than a generic leaderboard. If you publish your benchmark, the broader research and dev community will often pick it up and help improve it.
DeepSWE’s limitations and what’s next
No benchmark is perfect, and DeepSWE is very open about its own limitations:
- It focuses on active open source repos with at least 500 stars, which may not fully represent legacy or proprietary codebases.
- It currently emphasizes long-horizon feature and behavior changes more than pure debugging or refactoring tasks.
- Language coverage is still limited, with heavy emphasis on TypeScript, Go, and Python, and less (or no) coverage for languages like C++ and Java—for now.
- Because the tasks and verifiers are public, future models could overfit to the benchmark, so private, harder versions will be important.
Even with these caveats, DeepSWE is a big step forward. It aligns much more closely with how developers actually use AI agents and exposes the real gaps between models that can ship code and models that only look good on paper.
What this means for developers choosing models and tools
If you’re picking a coding model or building an AI-powered dev tool, the message from DeepSWE is clear:
- Don’t trust benchmarks that rely on contaminated repos, verbose prompts, and AI-only grading.
- Expect a large quality gap between top frontier models (like GPT-5.5) and mid-tier or open-weight models for complex, repo-wide tasks.
- Pay attention to tokens, speed, and cost per successful task—not just per-token pricing.
- Design your prompts and harnesses to look like real usage: short, behavior-focused, and flexible enough for the model to explore and test its own work.
Most importantly, treat public leaderboards as rough signals, not ground truth. The best benchmark for your team is the one built from your own code, your own tasks, and your own definition of success.
Comments
No comments yet. Be the first to share your thoughts!