Ad
Skip to content

New math benchmark reveals AI models confidently solve problems that have no solution

Image description
Nano Banana Pro prompted by THE DECODER

A consortium of 64 mathematicians built a new benchmark for AI models that exposes two weaknesses: research-level math and the ability to recognize unsolvable tasks.

With today's frontier models already hitting IMO Gold level, AI research needs new math benchmarks. SOOHAK, developed at Carnegie Mellon University, EleutherAI, and Seoul National University, among others, consists of 439 original tasks.

They're split into two sections: a "Challenge" set with 340 problems at the graduate and research level, and a "Refusal" set with 99 intentionally flawed problems that contain contradictions or don't allow a clear answer.

Unlike common collections, SOOHAK wasn't pulled from competitions or textbooks. Every problem was written from scratch by a team of 38 professors, 25 PhD students and postdocs, and five IMO medalists. Before submitting, each contributor had to confirm they worked without AI help. Anyone caught sneaking in LLM-generated tasks was kicked out.

Flowchart of the SOOHAK data collection pipeline showing submission, automated LLM checks, manual moderation, revisions, and the final verified dataset.
The SOOHAK benchmark went through several collection and review stages: submission, automated LLM checks, manual moderation, revisions, and final inclusion in the dataset. | Image: Son et al.

Research-level math is still a wall

According to the authors, Google's Gemini 3 Pro scored highest on the challenge set at 30 percent, followed by GPT-5 (5.1, 5.2) at 26 percent. Claude Opus 4.5 drops to 10 percent. Open-weight models like Kimi-2.5, Qwen3-235B, and GPT-OSS-120B all stay below 15 percent. Not a single model tested could solve 124 of the challenge tasks.

Table of SOOHAK results for closed and open-weight AI models. Scores on SOOHAK-Mini are much higher than on research-level tasks and on recognizing flawed problems.
Most models still did well on the easier SOOHAK-Mini. On research-level tasks, scores drop sharply. On recognizing unsolvable problems, even the best model stays below 50 percent. | Image: Son et al.

On the easier companion set SOOHAK-Mini—which ranges from school olympiad to early college level—scores are much higher, and the top models cluster closer together. The gap only opens up at research-level math, especially for open-weight models. The authors say this suggests open-weight systems transfer worse to unpublished material because they lack training coverage in niche areas.

When there's no solution, models guess anyway

The real break with earlier benchmarks is the refusal set. It contains problems that were flagged as unsolvable during quality control, because they're missing assumptions or contain contradictions. A model only gets credit if it spots and names the flaw instead of confidently producing a number.

Line chart showing model rankings across SOOHAK subsets and carefulness-adjusted composite scores. Some models shift sharply in rank once refusal scoring is factored in.
The accuracy-adjusted rankings show that strong problem-solving and reliable rejection of unsolvable tasks don't go hand in hand. | Image: Son et al.

No model clears the 50 percent mark here. The open-weight GLM-5 performs best at just under 50 percent, beating both GPT-5 and Gemini 3 Pro. The Qwen3 family collapses to less than 3 percent, almost always failing to correctly flag a broken problem.

The authors describe detecting flawed problems as "a new optimization target that current models do not directly address." Solution rates climb almost linearly with bigger models and longer reasoning budgets. Refusal doesn't follow the same pattern. More compute makes models better at solving. It doesn't make them better at admitting a problem has no answer.

Three charts on SOOHAK: Qwen3 scaling by model size, test-time scaling with more compute budget, and share of unsolved tasks across the Mini, Challenge, and Refusal sets.
Bigger models and longer reasoning mainly boost solution rates on the challenge set. SOOHAK shows no comparable scaling for recognizing flawed tasks. | Image: Son et al.

Olympiad experience beats research depth

For a human comparison, the team recruited 25 participants across five groups, from IMO medalists to PhD mathematicians. On a selection of 79 tasks, the groups together solved 51 percent. Only Gemini-3-Pro beat that combined human coverage, hitting 61 percent.

Bar chart comparing model and human accuracy on 79 SOOHAK tasks. Gemini-3-Pro reaches 60.8 percent while combined human coverage sits at 50.6 percent.
On the 79-task set, only Gemini-3-Pro exceeded the combined human coverage. Individual groups with Olympiad experience outperformed the PhD researchers. | Image: Son et al.

The PhD researchers actually did worse than students with Olympiad backgrounds. The authors chalk this up to format: the 4.5-hour time window rewards short solution paths trained in math competitions, while the benchmark's broad topic range doesn't help narrow research specialists. SOOHAK primarily measures competitive math under time pressure, not research depth.

Dataset locked until 2026, and the format has gaps

The full dataset won't be public until the end of 2026, a precaution against training data contamination. Until then, the team will evaluate models on request. The authors are open about SOOHAK's shortcomings: requiring clean numerical answers leaves out large swaths of higher math that would be better tested through proofs, constructions, or counterexamples. A future version would need richer formats, like formal proof assistants or expert review panels.

How far AI models actually get in research math is still an open question. Fields Medalist Timothy Gowers recently said ChatGPT 5.5 Pro produced a PhD-level result in number theory in under two hours, turning an exponential bound into a polynomial one. GPT-5.2 Pro came up with a new proof of Erdos problem #281 that mathematician Terence Tao called "rather different" from earlier proofs.

Tao is careful not to read too much into those wins, though. When he ran a systematic check across open Erdos problems, the models' real success rate was just one to two percent, and mostly on the easier ones. That gap between a few flashy results and actual broad research skill is what SOOHAK tries to pin down.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder