AI benchmarks are broken and the industry keeps using them anyway, study finds
Key Points
- AI benchmarks are pretty unreliable, Epoch AI finds: Scores shift a lot based on prompt wording, temperature settings, or which testing tool is used.
- The API provider causes the widest swings in evaluation results. The same model can get very different results depending on who's running it.
- These variables pile up, the researchers warn, so don't expect to match the official benchmark scores that model makers put out.
Benchmarks are supposed to measure AI model performance objectively. But according to an analysis by Epoch AI, results depend heavily on how the test is run. The research organization identifies numerous variables that are rarely disclosed but significantly affect outcomes.
The researchers split the problem sources into two categories: benchmark setup (how the test is run) and model access (how the model being tested is called). Both areas contain significant wiggle room that can skew final results, Epoch AI says.

Different implementations produce different results
Even with established tests like GPQA-Diamond, different libraries use different prompt formulations and temperature settings. The researchers compared four popular benchmark libraries and found discrepancies across the board: EleutherAI uses a temperature of 0.0, OpenAI's simple-evals uses 0.5, while OpenAI's gpt-oss defaults to 1.0. In tests, the same model's results varied between 74 and 80 percent depending on configuration.
The effect is even more pronounced with complex agent benchmarks like SWE-bench Verified. Here, the scaffold plays a central role, the software that controls the AI agent and provides it with tools. Switching scaffolds alone creates up to 11 percentage points difference with GPT-5 and up to 15 percentage points with Kimi K2 Thinking, according to Epoch AI. The researchers say scaffold choice has the "single biggest impact on the overall performance."

API provider choice causes the widest score swings
The API provider causes the widest swings in evaluation results. Epoch AI tested several open-source models across different providers and found significant score variations for the same model.

The errors come from multiple sources: rate limits, empty or truncated responses, lower token limits than advertised, and incorrectly transmitted parameters. MiniMax reports a 23 percentage point difference in tau-bench between its own API implementation and standard interfaces.
Newer models like GLM-4.6 tend to be served worse than established models like Qwen3, the researchers found. This makes quick evaluations right after a model drops difficult - exactly when interest is highest.
Test environments can be gamed
The execution environment also creates pitfalls. OpenAI could only run 477 out of 500 SWE-bench problems in its o3 and o4-mini evaluations due to "infrastructure challenges." Sometimes test environments contain critical bugs that let agents "hack the eval," Epoch AI says. Conversely, bugs can prevent agents from completing tasks at all.
Evaluations that give agents web access are especially vulnerable. In the worst case, the agent can find the original dataset or pages that republish parts of the problem.
The coding model IQuest-Coder offers a recent example: the 40 billion parameter model beat significantly larger competitors on SWE-bench. The benchmark tests whether AI models can fix real software bugs from GitHub repositories. But as developer Xeophon discovered on X, the test environment was apparently misconfigured and contained the complete Git history including future commits.
The model exploited this bug and simply read existing solutions from the version history instead of solving problems on its own. Still, IQuest-Coder attracted significant attention in the first few days after release, showing how quickly impressive benchmark results can go viral before methodological flaws surface.
Benchmark reliability problems run deep
The problems with AI benchmarks aren't new. An independent study previously showed that OpenAI's o1 achieved widely varying results in programming tests depending on the framework used. A comprehensive study of 445 benchmark papers also revealed fundamental methodological weaknesses: nearly all benchmarks examined had flaws in definitions, task selection, or statistical analysis.
The researchers warn that many small variables add up across the entire stack. The result is numbers that deviate considerably from what model developers report. For evaluators, this means laborious and costly experiments to replicate known results - one of the main reasons why independent evaluations of open-source models take so long.
Transparency issues extend to benchmark funding too: OpenAI secretly funded the development of Epoch AI's major math benchmark, FrontierMath.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now