AI benchmarks are broken and the industry keeps using them anyway, study finds

Jan 10, 2026

Nano Banana Pro prompted by THE DECODER

Key Points

AI benchmarks are pretty unreliable, Epoch AI finds: Scores shift a lot based on prompt wording, temperature settings, or which testing tool is used.
The API provider causes the widest swings in evaluation results. The same model can get very different results depending on who's running it.
These variables pile up, the researchers warn, so don't expect to match the official benchmark scores that model makers put out.

Benchmarks are supposed to measure AI model performance objectively. But according to an analysis by Epoch AI, results depend heavily on how the test is run. The research organization identifies numerous variables that are rarely disclosed but significantly affect outcomes.

The researchers split the problem sources into two categories: benchmark setup (how the test is run) and model access (how the model being tested is called). Both areas contain significant wiggle room that can skew final results, Epoch AI says.

The diagram shows two categories of the benchmark pipeline: Benchmark Setup (prompts, scaffolds, execution environment, scoring) and Model Access (API, aggregator, provider, deployment). Scaffolds and model providers are marked as “high impact.” — Every stage of benchmarking involves choices that affect the final score. | Image: Epoch AI

Different implementations produce different results

Even with established tests like GPQA-Diamond, different libraries use different prompt formulations and temperature settings. The researchers compared four popular benchmark libraries and found discrepancies across the board: EleutherAI uses a temperature of 0.0, OpenAI's simple-evals uses 0.5, while OpenAI's gpt-oss defaults to 1.0. In tests, the same model's results varied between 74 and 80 percent depending on configuration.

The effect is even more pronounced with complex agent benchmarks like SWE-bench Verified. Here, the scaffold plays a central role, the software that controls the AI agent and provides it with tools. Switching scaffolds alone creates up to 11 percentage points difference with GPT-5 and up to 15 percentage points with Kimi K2 Thinking, according to Epoch AI. The researchers say scaffold choice has the "single biggest impact on the overall performance."

Bar chart compares SWE-bench-verified results from GPT-5 and Kimi K2 with three different scaffolds. The results vary between approximately 55 and 72 percent depending on the scaffold used. — SWE-bench results for the same model vary by up to 15 percentage points depending on the scaffold. | Image: Epoch AI

API provider choice causes the widest score swings

The API provider causes the widest swings in evaluation results. Epoch AI tested several open-source models across different providers and found significant score variations for the same model.

Scatter plot shows GPQA Diamond results from GLM-4.6 across 15 different API providers. Accuracy varies greatly: Together and Fireworks (API) achieve around 80 percent, while Mancer and AtlasCloud are below 40 percent. — GLM-4.6 achieves significantly different results on GPQA-Diamond depending on the API provider. | Image: Epoch AI

The errors come from multiple sources: rate limits, empty or truncated responses, lower token limits than advertised, and incorrectly transmitted parameters. MiniMax reports a 23 percentage point difference in tau-bench between its own API implementation and standard interfaces.

Newer models like GLM-4.6 tend to be served worse than established models like Qwen3, the researchers found. This makes quick evaluations right after a model drops difficult - exactly when interest is highest.

Test environments can be gamed

The execution environment also creates pitfalls. OpenAI could only run 477 out of 500 SWE-bench problems in its o3 and o4-mini evaluations due to "infrastructure challenges." Sometimes test environments contain critical bugs that let agents "hack the eval," Epoch AI says. Conversely, bugs can prevent agents from completing tasks at all.

Evaluations that give agents web access are especially vulnerable. In the worst case, the agent can find the original dataset or pages that republish parts of the problem.

The coding model IQuest-Coder offers a recent example: the 40 billion parameter model beat significantly larger competitors on SWE-bench. The benchmark tests whether AI models can fix real software bugs from GitHub repositories. But as developer Xeophon discovered on X, the test environment was apparently misconfigured and contained the complete Git history including future commits.

The model exploited this bug and simply read existing solutions from the version history instead of solving problems on its own. Still, IQuest-Coder attracted significant attention in the first few days after release, showing how quickly impressive benchmark results can go viral before methodological flaws surface.

Benchmark reliability problems run deep

The problems with AI benchmarks aren't new. An independent study previously showed that OpenAI's o1 achieved widely varying results in programming tests depending on the framework used. A comprehensive study of 445 benchmark papers also revealed fundamental methodological weaknesses: nearly all benchmarks examined had flaws in definitions, task selection, or statistical analysis.

The researchers warn that many small variables add up across the entire stack. The result is numbers that deviate considerably from what model developers report. For evaluators, this means laborious and costly experiments to replicate known results - one of the main reasons why independent evaluations of open-source models take so long.

Transparency issues extend to benchmark funding too: OpenAI secretly funded the development of Epoch AI's major math benchmark, FrontierMath.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Epoch AI