OpenAI's o3 model shows major gains through reinforcement learning scaling

OpenAI reports breakthrough performance with its new o3 reasoning model. The company attributes much of this progress to reinforcement learning, a well-established AI training method that offers significant advantages for certain tasks but has clear limitations for others.

OpenAI researcher Nat McAleese explains that while the o models are still "just" large language models, they incorporate true reinforcement learning (RL) - similar to the approach used in DeepMind's AlphaGo.

Unlike traditional language models, which rely primarily on reinforcement learning from human feedback (RLHF), the o models learn through well-defined goals and scenarios. This mirrors AlphaGo's training process, where the system had a clear goal - to win the game - and was able to refine its strategy through countless simulated matches until it achieved superhuman performance.

The approach proves especially effective for programming and mathematics, where solutions can be clearly verified as right or wrong. Rather than simply predicting the next word in a sequence, o3 learns to construct chains of thoughts that lead to correct solutions, which explains its exceptional performance in mathematical and coding benchmarks.

Scaling up compute power

OpenAI divides this approach into two phases. The first, which they call "train-time compute," happens during the initial training. According to McAleese, it's this increased scale of reinforcement learning that explains why o3 performs so much better than o1. Then, when the model is actually running, they add extra computing power - what they term "test-time compute" - to help it better predict sequences of thoughts.

This vision of combining reinforcement learning with language models isn't unique to OpenAI. DeepMind CEO Demis Hassabis described a similar future last summer, suggesting AI would merge "the strengths of AlphaGo-type systems with the amazing language capabilities of the large models." The company recently put this idea into practice with Gemini 2.0 Flash Thinking, which likely uses comparable training methods.

Facing real-world challenges

While researcher Noam Brown sees this scaling trend continuing, there's a catch: the approach requires significant computing resources. That's why OpenAI is already working on o3-mini, a smaller version that aims to maintain strong performance while using fewer resources. The company plans to release this streamlined model in late January.

Former OpenAI researcher and Tesla AI chief Andrej Karpathy recently highlighted some key limitations of using reinforcement learning in language models. When it comes to more subjective tasks like writing style or summarizing text - where success depends more on nuance than clear right or wrong answers - the earlier o1 model doesn't outperform GPT-4o and sometimes falls short. We don't yet have benchmark data showing how o3 handles these more open, "vibe"-based tasks.

Both models also face a greater test: proving themselves in complex real-world scenarios where problems are not neatly defined, may contain contradictions, and require extensive planning, still a weakness of o1.

Recommendation

AI in practice

OpenAI launches Sora video generator for ChatGPT subscribers

Despite these challenges, o3's benchmark results tell an impressive story. Tamay Besiroglu, who helped develop the rigorous Frontier Math benchmark, says o3 has blown past expectations. While today's best models typically solve just two percent of these problems, o3 manages to crack about 25 percent - a level of performance Besiroglu thought was at least a year away.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

OpenAI's o3 model shows major gains through reinforcement learning scaling

Scaling up compute power

Facing real-world challenges

OpenAI launches Sora video generator for ChatGPT subscribers

OpenAI's o3-pro may be too smart for small talk

OpenAI cuts o3 model prices by 80% and launches o3-pro today

Google's Gemini 2.5 Pro beats OpenAI's o3 model in processing complex, lengthy texts

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

OpenAI's o3 model shows major gains through reinforcement learning scaling

Scaling up compute power

Facing real-world challenges

Share

Bank details