Reasoning models like OpenAI's o3 are making rapid progress, especially on math and coding tasks. But how much further can this new training approach scale, and where might the limits be? A new analysis from Epoch AI digs into these questions.
Reasoning models are seen as a major next step for large language models. After traditional pre-training, these systems go through an extra phase called "reasoning training," where they're tuned with reinforcement learning to solve complex problems. OpenAI's o3 is one of the earliest of these models. According to benchmarks, it shows clear gains over its predecessor o1. The big question: How long can this kind of progress continue just by throwing more compute at the problem?
Epoch AI set out to answer that. Data analyst Josh You looked into how much compute is currently being invested in reasoning training—and how much headroom there might be left.
OpenAI's tenfold scaling
OpenAI says it trained o3 with ten times as much reasoning compute as o1—just four months after o1's release. One OpenAI chart shows a tight link between compute and performance on the AIME math benchmark. Epoch AI believes these numbers refer to the compute for the second training phase, not the full model training.
OpenAI hasn't published absolute numbers. To estimate, Epoch AI looked at similar models like DeepSeek-R1, which achieved comparable benchmark results to o1 and was reportedly trained with about 6e23 FLOP—at a cost of roughly $1 million.
Different models, different approaches
Nvidia and Microsoft have also released reasoning models with publicly documented training data. Nvidia's Llama-Nemotron Ultra 253B used about 140,000 H100 GPU-hours—roughly 1e23 FLOP—for its reasoning phase, while Microsoft's Phi-4-reasoning used even less compute, under 1e20 FLOP. Both models relied heavily on synthetic training data generated by other AI systems. According to Epoch AI, this makes direct comparisons with models like o3 tricky.
Another issue: "Reasoning training" isn't precisely defined. In addition to reinforcement learning, some models use methods like supervised fine-tuning. It's not always clear which parts are included in compute estimates.
Still room to grow—but not without limits
For now, reasoning models are still far below the total compute used in the largest AI training runs, like Grok 3, which tops 1e26 FLOP. Today's reasoning phases typically fall between 1e23 and 1e24 FLOP.
Anthropic CEO Dario Amodei, quoted by Epoch AI, sees things similarly. He thinks investments of $1 million in reasoning training are enough for major progress right now. But companies are already working to push the cost of this second training phase into the hundreds of millions and beyond.
If the current pace continues—roughly tenfold increases every three to five months—reasoning compute could catch up with the total training compute of leading models as soon as next year. After that, according to You, growth will slow to around a 4x increase per year, matching the broader industry trend.
Barriers: data, domains, and development cost
Epoch AI points out that compute isn't the only bottleneck. Reasoning training needs huge amounts of high-quality, challenging tasks, which are hard to come by and even harder to generate synthetically. It's also unclear how well the approach will work outside of highly structured fields like math and programming. Still, projects like "Deep Research" in ChatGPT, which uses a custom-tuned version of o3, suggest there may be some flexibility.
There's another challenge, too: much of the behind-the-scenes work—choosing the right tasks, designing reward functions, and developing training strategies—is labor-intensive. These development costs usually aren't included in compute estimates.
Still, OpenAI and other developers remain optimistic, according to Epoch AI. So far, the scaling curves for reasoning training look a lot like the classic log-linear progress seen in pre-training. And o3 isn't just ahead in math—it also shows major gains in agent-based software tasks.
How long this progress will last depends on how efficiently reasoning training can keep scaling—technically, economically, and in terms of content.