New math benchmark reveals AI models confidently solve problems that have no solution
A consortium of 64 mathematicians built a new benchmark for AI models that exposes two weaknesses: research-level math and the ability to recognize unsolvable tasks.
With today's frontier models already hitting IMO Gold level, AI research needs new math benchmarks. SOOHAK, developed at Carnegie Mellon University, EleutherAI, and Seoul National University, among others, consists of 439 original tasks.
They're split into two sections: a "Challenge" set with 340 problems at the graduate and research level, and a "Refusal" set with 99 intentionally flawed problems that contain contradictions or don't allow a clear answer.
Unlike common collections, SOOHAK wasn't pulled from competitions or textbooks. Every problem was written from scratch by a team of 38 professors, 25 PhD students and postdocs, and five IMO medalists. Before submitting, each contributor had to confirm they worked without AI help. Anyone caught sneaking in LLM-generated tasks was kicked out.

Research-level math is still a wall
According to the authors, Google's Gemini 3 Pro scored highest on the challenge set at 30 percent, followed by GPT-5 (5.1, 5.2) at 26 percent. Claude Opus 4.5 drops to 10 percent. Open-weight models like Kimi-2.5, Qwen3-235B, and GPT-OSS-120B all stay below 15 percent. Not a single model tested could solve 124 of the challenge tasks.

On the easier companion set SOOHAK-Mini—which ranges from school olympiad to early college level—scores are much higher, and the top models cluster closer together. The gap only opens up at research-level math, especially for open-weight models. The authors say this suggests open-weight systems transfer worse to unpublished material because they lack training coverage in niche areas.
When there's no solution, models guess anyway
The real break with earlier benchmarks is the refusal set. It contains problems that were flagged as unsolvable during quality control, because they're missing assumptions or contain contradictions. A model only gets credit if it spots and names the flaw instead of confidently producing a number.

No model clears the 50 percent mark here. The open-weight GLM-5 performs best at just under 50 percent, beating both GPT-5 and Gemini 3 Pro. The Qwen3 family collapses to less than 3 percent, almost always failing to correctly flag a broken problem.
The authors describe detecting flawed problems as "a new optimization target that current models do not directly address." Solution rates climb almost linearly with bigger models and longer reasoning budgets. Refusal doesn't follow the same pattern. More compute makes models better at solving. It doesn't make them better at admitting a problem has no answer.

Olympiad experience beats research depth
For a human comparison, the team recruited 25 participants across five groups, from IMO medalists to PhD mathematicians. On a selection of 79 tasks, the groups together solved 51 percent. Only Gemini-3-Pro beat that combined human coverage, hitting 61 percent.

The PhD researchers actually did worse than students with Olympiad backgrounds. The authors chalk this up to format: the 4.5-hour time window rewards short solution paths trained in math competitions, while the benchmark's broad topic range doesn't help narrow research specialists. SOOHAK primarily measures competitive math under time pressure, not research depth.
Dataset locked until 2026, and the format has gaps
The full dataset won't be public until the end of 2026, a precaution against training data contamination. Until then, the team will evaluate models on request. The authors are open about SOOHAK's shortcomings: requiring clean numerical answers leaves out large swaths of higher math that would be better tested through proofs, constructions, or counterexamples. A future version would need richer formats, like formal proof assistants or expert review panels.
How far AI models actually get in research math is still an open question. Fields Medalist Timothy Gowers recently said ChatGPT 5.5 Pro produced a PhD-level result in number theory in under two hours, turning an exponential bound into a polynomial one. GPT-5.2 Pro came up with a new proof of Erdos problem #281 that mathematician Terence Tao called "rather different" from earlier proofs.
Tao is careful not to read too much into those wins, though. When he ran a systematic check across open Erdos problems, the models' real success rate was just one to two percent, and mostly on the easier ones. That gap between a few flashy results and actual broad research skill is what SOOHAK tries to pin down.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe nowRead on for the full picture.
Subscribe for hype-free coverage.
- Access to all THE DECODER articles.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.