Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

Matthias Bastian

Jul 22, 2025

Google Deepmind

Key Points

Google DeepMind's extended Gemini model in "Deep Think" mode reached gold level at the International Mathematical Olympiad by solving five out of six tasks, a milestone not previously achieved.
The system managed to solve these problems using only natural language and without relying on symbolic tools, marking a significant shift from earlier expectations.
Deep Think's performance was enabled by specialized reinforcement learning, extra "thinking time," and exposure to a curated set of previous IMO solutions. The standard Gemini 2.5 Pro version could only solve 31.5 percent of the tasks.

Update – Jul 22, 2025

A statement from Vinay Ramasesh regarding IMO-specific data training has been added.

Google DeepMind has claimed its first gold medal at the International Mathematical Olympiad (IMO) with an "advanced version" of its Gemini model running in Deep Think mode.

The system solved five out of six problems in algebra, combinatorics, geometry, and number theory, earning 35 out of 42 possible points—enough for a gold medal, which only about eight percent of human participants achieve, according to the IMO. DeepMind says the solutions (PDF download) were reviewed by official IMO judges and described as "clear, precise and most of them easy to follow."

What makes this win stand out is the method: last year, DeepMind used formal languages like Lean and spent days computing with AlphaProof and AlphaGeometry, but this time, Gemini Deep Think worked entirely in natural language.

The model produced full proofs directly from the official IMO problems, all within the four-and-a-half-hour time limit per session and without external tools or symbolic aids. DeepMind notes that Gemini faced the same problems and time constraints as human competitors.

The IMO model runs on the new "Deep Think" mode of Gemini 2.5 Pro, which Google introduced in May for complex reasoning tasks. This mode lets the model follow multiple hypotheses in parallel before generating an answer and is currently being tested with select users. For comparison, the standard Gemini 2.5 Pro managed to solve only 31.5 percent of the Olympiad's problems.

Gemini Deep Think was trained with specialized reinforcement learning methods to encourage multi-step reasoning, problem-solving, and theorem-proving. The IMO version also had more "thinking time," access to a curated set of high-quality solutions from previous IMO tasks, and general guidance on tackling these kinds of problems. DeepMind says these methods helped the model follow and combine several solution paths in parallel before settling on a final answer.

Update: According to DeepMind researcher Vinay Ramasesh, a 'Deep Think' system with no access to IMO tasks or specific instructions for the IMO competition can still win a gold medal with the same score. OpenAI researcher Aidan McLaughlin had previously confirmed that OpenAI did not use IMO-specific context either.

OpenAI also claims math gold

OpenAI announced its own IMO gold medal last weekend. According to OpenAI, one of its internal language models also solved five out of six Olympiad problems under competition conditions, with proofs reviewed by three former IMO gold medalists.

OpenAI says its model worked through two four-and-a-half-hour sessions with no internet access, code, or external tools—relying entirely on natural language. Like DeepMind, OpenAI notes that its model is a generalist reasoning system, not one trained exclusively for the IMO.

Until recently, this kind of result was considered nearly impossible. Even mathematician Terence Tao doubted in June that a language model could solve IMO problems in real time. The fact that two systems crossed this milestone at the same time marks a major shift.

A new phase for reasoning AI—with open questions

Both results suggest that advanced AI models with strong reasoning and reinforcement learning can now tackle complex math problems for hours at a stretch—without relying on symbolic tools.

However, these announcements leave some questions unanswered. For example, OpenAI hasn’t shared any details about the model architecture, training data, or resources used. Similarly, DeepMind hasn’t said how scalable or transferable its Deep Think approach might be, nor has it addressed whether the approach could handle other tasks or scientific fields. It's also unclear how consistently these systems would perform on longer proofs or in other branches of mathematics.

Still, the results show that the approach works in practice, and for now, the details may matter less than the outcome. Sustained, accurate reasoning over hours has long been seen as a major hurdle for language models. With these results, the race for reasoning-capable AI is entering a new phase—and, at least in math, machines are moving much closer to human-level performance.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Deepmind Blog | Deepmind via X