Content
summary Summary

A Stanford professor has spent the past year testing the same unsolved math problem on OpenAI's models, unintentionally tracking their progress in self-assessment along the way.

Ad

"I've actually been emailing with the Stanford mathematics professor. He emailed me about a year ago before we announced o1 and said, 'Hey, do you want to do a collaboration on solving hard math problems?' Basically, I told him I think we just have to advance general reasoning capabilities, and eventually they're going to be able to help you with your hard math problems. I think that's actually the most promising route to getting there. He was a little skeptical, but with every model release, every reasoning model release, he emails me with a follow-up and asks, 'Can it solve this problem now?' I plug them in and send him the output, and he says, 'Yeah, that's wrong,'" recalls Noam Brown of OpenAI.

But after OpenAI's recent breakthrough at the International Mathematical Olympiad, something important has changed: "He emailed me a follow-up this time with the same problem, asking, 'Hey, can it solve it now?' It still can't solve it, but at least this time it recognizes that it can't solve it, so I think that's a big step." Instead of hallucinating, the model simply said "no answer" to this year's hardest IMO problem. As Brown puts it, "I think it was good to see the model doesn't try to hallucinate or just make up some solution, but instead will say 'no answer.'"

This accidental long-term study reveals a kind of progress that standard benchmarks have missed: the models might be getting a little better at recognizing their own limitations, rather than generating confident but wrong answers.

Ad
Ad

A Spanish research team takes a similar view of the much-discussed results from Apple's reasoning study. There too, reasoning models like o3 stopped their output prematurely. While Apple's researchers saw this as a simple failure, the Spanish team argues it's evidence of a learned strategy: the models realize they've hit a wall and stop.

It will likely be some time before we're fully protected from AI-generated bullshit. OpenAI plans to make its IMO model available to mathematicians for testing, but the core improvements behind this progress are not expected to appear in commercial models for several more months.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A Stanford professor has been testing the same unsolved math problem on OpenAI's models for a year, and recently noticed that while the models still can't solve it, they now correctly admit when they don't have an answer instead of making up a solution.
  • This shift—from producing confident but wrong answers to acknowledging limitations—highlights progress that traditional benchmarks often overlook, as the models become better at self-assessment.
  • Similar findings have emerged from other research: a Spanish team suggests that when AI models stop their output early in difficult reasoning tasks, it may show they've learned to recognize when they're stuck, rather than simply failing, though improvements from these breakthroughs are not expected in commercial products for several more months.
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.