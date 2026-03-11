Ask about this article… Search

A new study by the research organization METR suggests that the popular coding benchmark SWE-bench Verified significantly overestimates the real-world performance of AI agents. Roughly half of the solutions rated as "passed" would get rejected by actual project maintainers.

SWE-bench Verified is (or was) one of the most important benchmarks for AI-assisted software engineering. It measures whether AI agents can solve real programming problems from open-source projects, with an automated tester checking whether submitted code changes pass the associated tests. Companies like Anthropic and OpenAI regularly cite the results to show off their models' progress.

An investigation by METR now raises serious questions about the benchmark's validity. The research team—Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush—had four experienced developers who actively maintain three SWE-bench projects (scikit-learn, Sphinx, and pytest) review a total of 296 AI-generated code contributions. About half of the solutions that passed the automated tester would never get merged into the actual codebase by the project maintainers.

Automated tests only capture part of what makes code good

The study used AI-generated solutions from five models: Claude 3.5 Sonnet (Old), Claude 3.7 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and GPT-5. The test runs came from Epoch AI's Benchmarking Hub. The maintainers didn't know whether a given solution came from a human or an AI.

To account for noise in human decision-making, the researchers also had the maintainers evaluate original human solutions that had actually shipped in the projects. Only 68 percent of these reference solutions got re-approved for adoption. All results were normalized to this baseline.

On average, the maintainer adoption rate lands about 24 percentage points below the SWE-bench score. METR says this difference is statistically significant. The rate of improvement over time also comes in about 9.6 percentage points lower per year when measured by human assessments, though the researchers themselves note this trend result is statistically weaker.

Many rejections come down to basic functional errors, not just style

The maintainers sorted their rejections into three buckets: poor code quality (bad style or failure to follow project standards), damage to existing code, and basic functional errors. According to the METR report, "a meaningful chunk" of rejections came down to basic functional errors, meaning the AI agents didn't actually fix the underlying problem, even though the automated tests passed.

The jump from Claude 3.5 Sonnet to Claude 3.7 Sonnet produced significantly higher pass rates, but it also led to more cases where maintainers flagged basic functional errors. From Claude 3.7 to Claude 4 Opus, the issues shifted from "test failed" to "just poor code quality." Claude 4.5 Sonnet mainly improved code quality. GPT-5 performed significantly worse than the Anthropic models here, according to the study.

Time horizon analysis reveals a sevenfold overestimation

The researchers also ran a time horizon analysis that converts benchmark scores into the human completion time at which a model hits a 50 percent success rate. The gap is stark: Claude 4.5 Sonnet reaches a time horizon of about 50 minutes according to the automated checker, but only about 8 minutes when scored by maintainers. The study notes these estimates are less stable because SWE-bench Verified has a relatively narrow task-duration range, requiring extrapolation to estimate the 50% threshold.

What METR describes here is something the field already knows: benchmarks are, at best, a proxy for real-world AI performance. Practical feedback matters more. Both sides of the argument exist; plenty of developers say AI agents genuinely help with coding, while others flag serious concerns about the quality of AI-generated code. The AI coding hype of the last four months or so also centers on newer models like GPT-5.2-Thinking and Claude Opus 4.5. These models weren't part of the test; the study only looked at weaker predecessors.

That said, if the results hold up, the gap is still remarkably large. SWE-bench Verified is considered the gold standard for evaluating AI coding agents. If half of the solutions it marks as "passed" can't survive a real human review, that undermines individual model comparisons and chips away at the foundation on which investors, companies, and developers are building their expectations for AI-assisted software development.

Important findings, but with clear limitations

The researchers are upfront that the study doesn't prove a fundamental capability ceiling for these models. They say it's "plausible" that better prompting strategies and more targeted instructions could narrow the gap between automated evaluation and human judgment.

The comparison also wasn't apples to apples. Human developers can typically respond to feedback and revise their code, but the AI agents only got one shot. The review setup wasn't fully realistic either: the maintainers didn't have automated testing tools available, the problems came from older project states, and testing requirements were deliberately relaxed to give the AI agents a better chance.

So the study isn't about the maximum capability of these models. What it does show is that taking benchmark scores at face value overestimates how useful AI agents actually are without additional tweaks and human feedback. The researchers suspect similar distortions show up in other benchmarks too.

Co-author Joel Becker, a researcher at METR, also puts his own results in perspective on X. His main takeaway isn't that AI agents are fundamentally useless. With AI capabilities doubling every three to six months, even two- to tenfold performance gaps get closed fast.