New o1 benchmark raises more questions about AI benchmarking in general

Midjourney prompted by THE DECODER

Independent testing has found that OpenAI's o1 model solves just 30 percent of programming tasks in benchmark tests - not the 48.9 percent the company claimed. The findings add to a growing debate about how to measure AI capabilities.

In his new study using OpenAI's coding benchmark "SWE-Bench Verified," AI researcher Alejandro Cuadron found what he calls a "surprising" gap. While OpenAI reported their model could handle nearly half of the real-world programming tasks from Github, Cuadron's testing shows it managing less than a third.

Anthropic's Sonnet 3.5 came out ahead, solving 53 percent of the tasks - though this might be because the model helped develop the test procedure itself. Notably, the less expensive Deepseek v3 model performed about as well as OpenAI's o1 in Cuadron's testing.

Bar chart: Comparison of AI model performance on SWE-Bench Verified, Sonnet 3.5 leads with 50%, followed by O1 with 30%. — While Sonnet 3.5 leads with over 50% solved problems, OpenAI's o1 achieves only 30% - well below the claimed 48.9%. | Image: via X

Why the big difference?

The gap between OpenAI's claims and Cuadron's findings comes down to testing methods. OpenAI used "Agentless," a framework that gives AI very specific instructions for solving programming tasks. Cuadron, on the other hand, used "OpenHands," which gives the AI more freedom in how it approaches problems.

Cuadron says that OpenHands was considered the gold standard when OpenAI ran their tests, but they chose not to use it. He suspects OpenAI's more rigid testing method might favor models that simply memorize solutions rather than truly solving new problems independently.

This isn't just academic nitpicking. OpenAI has touted o1's supposed strength in reasoning and ability to tackle novel problems. "But why does O1 [sic] struggle with true open-ended planning despite its reasoning capabilities?" Cuadron writes.

Other research has cast doubt on these claims before. A recent travel planning benchmark showed o1-preview struggling with planning tasks, and an Apple study found that slightly varied math problems led to much worse performance - suggesting the model isn't good at generalizing knowledge, which undermines claims about its logical capabilities.

The bigger picture

This situation highlights a persistent problem in AI evaluation: benchmark results depend heavily on testing methods. When companies can optimize their models for specific test procedures, it becomes nearly impossible for outsiders to gauge an AI's true capabilities. This matters because these benchmark results drive PR campaigns and marketing efforts, which in turn influence investor funding.

Logan Kilpatrick, who recently moved from OpenAI to become Head of Product at Google AI Studio, emphasizes the need for better verification and openness in testing procedures. While he doesn't suspect OpenAI of any deliberate deception, he sees solving these evaluation problems as crucial for developing more advanced AI systems.

Recommendation

AI research

Researchers put OpenAI's o1 through its paces, exposing both breakthroughs and limitations

"The world needs [more, better, harder, etc] evals for AI," Kilpatrick argues. "This is one of the most important problems of our lifetime, and critical for continual progress."

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

New o1 benchmark raises more questions about AI benchmarking in general

Why the big difference?

The bigger picture

Researchers put OpenAI's o1 through its paces, exposing both breakthroughs and limitations

GPT-4.5 is no longer the most expensive AI model

OpenAI CEO says merging LLM scaling and reasoning may bring "new scientific knowledge"

OpenAI's o1-preview model manipulates game files to force a win against Stockfish in chess

Researchers build massive AI training dataset using only openly licensed sources

AI agents outperform human teams in hacking competitions

Wait a minute! Researchers say AI's "chains of thought" are not signs of human-like reasoning

New o1 benchmark raises more questions about AI benchmarking in general

Why the big difference?

The bigger picture

Share

Bank details