Ad
Skip to content

Half of AI-written code that passes industry test would get rejected by real developers, new study finds

Image description
Nano Banana Pro prompted by THE DECODER

Key Points

  • The research organization METR has analyzed the widely used coding benchmark SWE-bench Verified and concluded that it significantly overstates how well AI agents actually perform in real-world software development.
  • Four experienced open-source developers reviewed 296 AI-generated code contributions and found that roughly half of the solutions that passed automated tests would still be rejected from actual software projects.
  • Many of these rejections stem not from stylistic issues but from fundamental functional errors. The AI agents fail to fix the underlying problem, even when they manage to pass the automated test suite.

A new study by the research organization METR suggests that the popular coding benchmark SWE-bench Verified significantly overestimates the real-world performance of AI agents. Roughly half of the solutions rated as "passed" would get rejected by actual project maintainers.

SWE-bench Verified is (or was) one of the most important benchmarks for AI-assisted software engineering. It measures whether AI agents can solve real programming problems from open-source projects, with an automated tester checking whether submitted code changes pass the associated tests. Companies like Anthropic and OpenAI regularly cite the results to show off their models' progress.

An investigation by METR now raises serious questions about the benchmark's validity. The research team—Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush—had four experienced developers who actively maintain three SWE-bench projects (scikit-learn, Sphinx, and pytest) review a total of 296 AI-generated code contributions. About half of the solutions that passed the automated tester would never get merged into the actual codebase by the project maintainers.

Automated tests only capture part of what makes code good

The study used AI-generated solutions from five models: Claude 3.5 Sonnet (Old), Claude 3.7 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and GPT-5. The test runs came from Epoch AI's Benchmarking Hub. The maintainers didn't know whether a given solution came from a human or an AI.

Ad
DEC_D_Incontent-1

To account for noise in human decision-making, the researchers also had the maintainers evaluate original human solutions that had actually shipped in the projects. Only 68 percent of these reference solutions got re-approved for adoption. All results were normalized to this baseline.

On average, the maintainer adoption rate lands about 24 percentage points below the SWE-bench score. METR says this difference is statistically significant. The rate of improvement over time also comes in about 9.6 percentage points lower per year when measured by human assessments, though the researchers themselves note this trend result is statistically weaker.

Chart showing pass rates of five AI models over time, split by automated SWE-bench grader (orange) and maintainer adoption (blue). The orange line sits consistently well above the blue one, with an average gap of 24 percentage points.
Pass rates from the automated grader (orange) are consistently and significantly higher than actual acceptance rates by maintainers (blue) across all tested models. | Image: METR

Many rejections come down to basic functional errors, not just style

The maintainers sorted their rejections into three buckets: poor code quality (bad style or failure to follow project standards), damage to existing code, and basic functional errors. According to the METR report, "a meaningful chunk" of rejections came down to basic functional errors, meaning the AI agents didn't actually fix the underlying problem, even though the automated tests passed.

The jump from Claude 3.5 Sonnet to Claude 3.7 Sonnet produced significantly higher pass rates, but it also led to more cases where maintainers flagged basic functional errors. From Claude 3.7 to Claude 4 Opus, the issues shifted from "test failed" to "just poor code quality." Claude 4.5 Sonnet mainly improved code quality. GPT-5 performed significantly worse than the Anthropic models here, according to the study.

Ad
DEC_D_Incontent-2

Time horizon analysis reveals a sevenfold overestimation

The researchers also ran a time horizon analysis that converts benchmark scores into the human completion time at which a model hits a 50 percent success rate. The gap is stark: Claude 4.5 Sonnet reaches a time horizon of about 50 minutes according to the automated checker, but only about 8 minutes when scored by maintainers. The study notes these estimates are less stable because SWE-bench Verified has a relatively narrow task-duration range, requiring extrapolation to estimate the 50% threshold.

What METR describes here is something the field already knows: benchmarks are, at best, a proxy for real-world AI performance. Practical feedback matters more. Both sides of the argument exist; plenty of developers say AI agents genuinely help with coding, while others flag serious concerns about the quality of AI-generated code. The AI coding hype of the last four months or so also centers on newer models like GPT-5.2-Thinking and Claude Opus 4.5. These models weren't part of the test; the study only looked at weaker predecessors.

That said, if the results hold up, the gap is still remarkably large. SWE-bench Verified is considered the gold standard for evaluating AI coding agents. If half of the solutions it marks as "passed" can't survive a real human review, that undermines individual model comparisons and chips away at the foundation on which investors, companies, and developers are building their expectations for AI-assisted software development.

Important findings, but with clear limitations

The researchers are upfront that the study doesn't prove a fundamental capability ceiling for these models. They say it's "plausible" that better prompting strategies and more targeted instructions could narrow the gap between automated evaluation and human judgment.

The comparison also wasn't apples to apples. Human developers can typically respond to feedback and revise their code, but the AI agents only got one shot. The review setup wasn't fully realistic either: the maintainers didn't have automated testing tools available, the problems came from older project states, and testing requirements were deliberately relaxed to give the AI agents a better chance.

So the study isn't about the maximum capability of these models. What it does show is that taking benchmark scores at face value overestimates how useful AI agents actually are without additional tweaks and human feedback. The researchers suspect similar distortions show up in other benchmarks too.

Co-author Joel Becker, a researcher at METR, also puts his own results in perspective on X. His main takeaway isn't that AI agents are fundamentally useless. With AI capabilities doubling every three to six months, even two- to tenfold performance gaps get closed fast.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Becker via X | METR