Compute scaling drives reasoning model gains but cannot last forever

Reasoning models like OpenAI's o3 are making rapid progress, especially on math and coding tasks. But how much further can this new training approach scale, and where might the limits be? A new analysis from Epoch AI digs into these questions.

Reasoning models are seen as a major next step for large language models. After traditional pre-training, these systems go through an extra phase called "reasoning training," where they're tuned with reinforcement learning to solve complex problems. OpenAI's o3 is one of the earliest of these models. According to benchmarks, it shows clear gains over its predecessor o1. The big question: How long can this kind of progress continue just by throwing more compute at the problem?

Epoch AI set out to answer that. Data analyst Josh You looked into how much compute is currently being invested in reasoning training—and how much headroom there might be left.

OpenAI's tenfold scaling

OpenAI says it trained o3 with ten times as much reasoning compute as o1—just four months after o1's release. One OpenAI chart shows a tight link between compute and performance on the AIME math benchmark. Epoch AI believes these numbers refer to the compute for the second training phase, not the full model training.

OpenAI hasn't published absolute numbers. To estimate, Epoch AI looked at similar models like DeepSeek-R1, which achieved comparable benchmark results to o1 and was reportedly trained with about 6e23 FLOP—at a cost of roughly $1 million.

Different models, different approaches

Nvidia and Microsoft have also released reasoning models with publicly documented training data. Nvidia's Llama-Nemotron Ultra 253B used about 140,000 H100 GPU-hours—roughly 1e23 FLOP—for its reasoning phase, while Microsoft's Phi-4-reasoning used even less compute, under 1e20 FLOP. Both models relied heavily on synthetic training data generated by other AI systems. According to Epoch AI, this makes direct comparisons with models like o3 tricky.

Another issue: "Reasoning training" isn't precisely defined. In addition to reinforcement learning, some models use methods like supervised fine-tuning. It's not always clear which parts are included in compute estimates.

Still room to grow—but not without limits

For now, reasoning models are still far below the total compute used in the largest AI training runs, like Grok 3, which tops 1e26 FLOP. Today's reasoning phases typically fall between 1e23 and 1e24 FLOP.

Anthropic CEO Dario Amodei, quoted by Epoch AI, sees things similarly. He thinks investments of $1 million in reasoning training are enough for major progress right now. But companies are already working to push the cost of this second training phase into the hundreds of millions and beyond.

Recommendation

AI research

Researchers put OpenAI's o1 through its paces, exposing both breakthroughs and limitations

If the current pace continues—roughly tenfold increases every three to five months—reasoning compute could catch up with the total training compute of leading models as soon as next year. After that, according to You, growth will slow to around a 4x increase per year, matching the broader industry trend.

Barriers: data, domains, and development cost

Epoch AI points out that compute isn't the only bottleneck. Reasoning training needs huge amounts of high-quality, challenging tasks, which are hard to come by and even harder to generate synthetically. It's also unclear how well the approach will work outside of highly structured fields like math and programming. Still, projects like "Deep Research" in ChatGPT, which uses a custom-tuned version of o3, suggest there may be some flexibility.

There's another challenge, too: much of the behind-the-scenes work—choosing the right tasks, designing reward functions, and developing training strategies—is labor-intensive. These development costs usually aren't included in compute estimates.

Still, OpenAI and other developers remain optimistic, according to Epoch AI. So far, the scaling curves for reasoning training look a lot like the classic log-linear progress seen in pre-training. And o3 isn't just ahead in math—it also shows major gains in agent-based software tasks.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

How long this progress will last depends on how efficiently reasoning training can keep scaling—technically, economically, and in terms of content.

Compute scaling drives reasoning model gains but cannot last forever

OpenAI's tenfold scaling

Different models, different approaches

Still room to grow—but not without limits

Researchers put OpenAI's o1 through its paces, exposing both breakthroughs and limitations

Barriers: data, domains, and development cost

Researchers say they may have found a ladder to climb the "data wall"

OmniGen 2 blends image and text generation like GPT-4o, but is open source

Anthropic's Claude ran a store and lost money by selling below cost and giving discounts

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Compute scaling drives reasoning model gains but cannot last forever

OpenAI's tenfold scaling

Different models, different approaches

Still room to grow—but not without limits

Barriers: data, domains, and development cost

Share

Bank details