OpenAI’s math gold hints that AI may soon tackle even longer and harder tasks

Jul 21, 2025

GPT-4o prompted by THE DECODER

Key Points

An experimental AI model from OpenAI reportedly solved five out of six International Mathematical Olympiad problems under official competition rules, earning a gold medal score with natural language proofs graded by former IMO winners.
Unlike previous systems that rely on code or mathematical software, this model sustained complex reasoning for hours using only text, and OpenAI claims it was not specifically trained on IMO tasks but developed as a general-purpose reasoning system using reinforcement learning.
The achievement has drawn criticism due to OpenAI's lack of transparency about the model's architecture, training data, and costs, with experts questioning how much these results can be generalized to other fields and calling for more methodological details.

An unreleased OpenAI model has reportedly solved five out of six International Mathematical Olympiad (IMO) problems under competition conditions. But the story isn't just about what it solved - it's about how it got there.

OpenAI says an experimental language model scored 35 out of 42 possible points in an internal IMO-style test - enough for a gold medal. Three former IMO winners independently graded the model's natural language proofs, which were evaluated just like submissions from human contestants. According to the company, the test mirrored real IMO rules: two four-and-a-half-hour sessions, no internet, no external tools or code - just text.

OpenAI claims the model wasn't specifically trained on IMO tasks. Instead, it was developed as a general-purpose reasoning model, drawing on recent advances in reinforcement learning and using substantial compute during inference. Researcher Alexander Wei emphasized in an X post that this was not a task-specific system, but one capable of autonomously generating complex, multi-page proofs. There are hints it might even be a multi-agent system.

Sustained reasoning without tools

What makes this achievement stand out is that the model reasoned consistently for hours at a time without any symbolic tools like code interpreters or mathematical software. That sets it apart from other high-performing systems such as DeepMind's AlphaProof, which rely on hybrid neuro-symbolic approaches.

Until recently, it was widely believed that language models couldn't sustain consistent mathematical reasoning over long sessions. As recently as June, mathematician Terence Tao said on the Lex Fridman Podcast that IMO-level problems were too difficult for AI to solve in real time. "You can't hire enough humans to grade those," Tao said, referring to the labor-intensive verification of long proofs in reinforcement learning training.

The result came as a surprise, even to prediction markets, which put the odds of an AI winning IMO gold before the end of 2025 at under 20 percent. (These forecasts used slightly stricter criteria.)

Both the markets and Tao seemed to assume that a reasoning model like o3 would need to be trained explicitly for IMO proofs, receiving expert feedback at every step. OpenAI, however, appears to have found a more general method for eliciting this behavior. Wei also highlighted that the model wasn't tailored for the task, but instead was a generalist reasoning system.

OpenAI researcher Jerry Tworek says the reinforcement learning system used here also helped train ChatGPT Agent and the model that recently took second place at the Heuristics World Finals on AtCoder, where it generated code non-stop for nearly ten hours.

Transparency questions

As usual, OpenAI's claims have sparked criticism. Gary Marcus called the achievement impressive but raised a list of questions in an X post: How is the model architecturally different from its predecessors? What were the costs per problem? Was the model trained on raw text or preprocessed data? And how transferable are these results to other scientific domains? So far, OpenAI has kept all those details under wraps.

OpenAI has faced similar criticism before, notably for a lack of transparency around the ARC-AGI benchmark test. The ARC Prize Foundation found that the final o3 model performed worse than a previously tested preview version. It also only came to light after the fact that OpenAI funded the supposedly independent FrontierMath benchmark, just after it hit a record result there.

A scalable approach to reasoning?

In a recent essay, "How o3 and Grok 4 accidentally vindicated neurosymbolic AI," Marcus argued that modern AI models are increasingly relying on symbolic tools like code interpreters to overcome the limits of pure language models.

OpenAI's IMO system, on the other hand, worked entirely in text - no tools - which, if the results hold up, would be a notable exception. If the model's ability to generalize is confirmed, it could call Marcus's thesis into question, at least in part. Still, his main criticism remains: without methodological transparency, it's hard to interpret these achievements.

For now, OpenAI seems to have built a language model that can reason consistently for hours - without any external tools. That would have been difficult to imagine just a short time ago. The generalist reasoning approach appears to scale, at least for now. According to OpenAI, the next step is reasoning sessions that last several days.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.