Content
summary Summary

Reasoning models like OpenAI's o3 are making rapid progress, especially on math and coding tasks. But how much further can this new training approach scale, and where might the limits be? A new analysis from Epoch AI digs into these questions.

Ad

Reasoning models are seen as a major next step for large language models. After traditional pre-training, these systems go through an extra phase called "reasoning training," where they're tuned with reinforcement learning to solve complex problems. OpenAI's o3 is one of the earliest of these models. According to benchmarks, it shows clear gains over its predecessor o1. The big question: How long can this kind of progress continue just by throwing more compute at the problem?

Epoch AI set out to answer that. Data analyst Josh You looked into how much compute is currently being invested in reasoning training—and how much headroom there might be left.

OpenAI's tenfold scaling

OpenAI says it trained o3 with ten times as much reasoning compute as o1—just four months after o1's release. One OpenAI chart shows a tight link between compute and performance on the AIME math benchmark. Epoch AI believes these numbers refer to the compute for the second training phase, not the full model training.

Ad
Ad

OpenAI hasn't published absolute numbers. To estimate, Epoch AI looked at similar models like DeepSeek-R1, which achieved comparable benchmark results to o1 and was reportedly trained with about 6e23 FLOP—at a cost of roughly $1 million.

Different models, different approaches

Nvidia and Microsoft have also released reasoning models with publicly documented training data. Nvidia's Llama-Nemotron Ultra 253B used about 140,000 H100 GPU-hours—roughly 1e23 FLOP—for its reasoning phase, while Microsoft's Phi-4-reasoning used even less compute, under 1e20 FLOP. Both models relied heavily on synthetic training data generated by other AI systems. According to Epoch AI, this makes direct comparisons with models like o3 tricky.

Another issue: "Reasoning training" isn't precisely defined. In addition to reinforcement learning, some models use methods like supervised fine-tuning. It's not always clear which parts are included in compute estimates.

Still room to grow—but not without limits

For now, reasoning models are still far below the total compute used in the largest AI training runs, like Grok 3, which tops 1e26 FLOP. Today's reasoning phases typically fall between 1e23 and 1e24 FLOP.

Anthropic CEO Dario Amodei, quoted by Epoch AI, sees things similarly. He thinks investments of $1 million in reasoning training are enough for major progress right now. But companies are already working to push the cost of this second training phase into the hundreds of millions and beyond.

Recommendation

If the current pace continues—roughly tenfold increases every three to five months—reasoning compute could catch up with the total training compute of leading models as soon as next year. After that, according to You, growth will slow to around a 4x increase per year, matching the broader industry trend.

Barriers: data, domains, and development cost

Epoch AI points out that compute isn't the only bottleneck. Reasoning training needs huge amounts of high-quality, challenging tasks, which are hard to come by and even harder to generate synthetically. It's also unclear how well the approach will work outside of highly structured fields like math and programming. Still, projects like "Deep Research" in ChatGPT, which uses a custom-tuned version of o3, suggest there may be some flexibility.

There's another challenge, too: much of the behind-the-scenes work—choosing the right tasks, designing reward functions, and developing training strategies—is labor-intensive. These development costs usually aren't included in compute estimates.

Still, OpenAI and other developers remain optimistic, according to Epoch AI. So far, the scaling curves for reasoning training look a lot like the classic log-linear progress seen in pre-training. And o3 isn't just ahead in math—it also shows major gains in agent-based software tasks.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

How long this progress will last depends on how efficiently reasoning training can keep scaling—technically, economically, and in terms of content.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Reasoning models, such as OpenAI's o3, have shown significant progress in math and programming tasks through an additional training phase that makes increased use of computing power.
  • According to an analysis by Epoch AI, reasoning training is currently well below the computing limits of leading AI models but could grow significantly in the coming years.
  • According to Epoch AI, further scaling is primarily limited by the scarcity of high-quality tasks and the effort required to develop new training methods.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.