Content
summary Summary

OpenAI reports breakthrough performance with its new o3 reasoning model. The company attributes much of this progress to reinforcement learning, a well-established AI training method that offers significant advantages for certain tasks but has clear limitations for others.

Ad

OpenAI researcher Nat McAleese explains that while the o models are still "just" large language models, they incorporate true reinforcement learning (RL) - similar to the approach used in DeepMind's AlphaGo.

Unlike traditional language models, which rely primarily on reinforcement learning from human feedback (RLHF), the o models learn through well-defined goals and scenarios. This mirrors AlphaGo's training process, where the system had a clear goal - to win the game - and was able to refine its strategy through countless simulated matches until it achieved superhuman performance.

The approach proves especially effective for programming and mathematics, where solutions can be clearly verified as right or wrong. Rather than simply predicting the next word in a sequence, o3 learns to construct chains of thoughts that lead to correct solutions, which explains its exceptional performance in mathematical and coding benchmarks.

Ad
Ad

Scaling up compute power

OpenAI divides this approach into two phases. The first, which they call "train-time compute," happens during the initial training. According to McAleese, it's this increased scale of reinforcement learning that explains why o3 performs so much better than o1. Then, when the model is actually running, they add extra computing power - what they term "test-time compute" - to help it better predict sequences of thoughts.

This vision of combining reinforcement learning with language models isn't unique to OpenAI. DeepMind CEO Demis Hassabis described a similar future last summer, suggesting AI would merge "the strengths of AlphaGo-type systems with the amazing language capabilities of the large models." The company recently put this idea into practice with Gemini 2.0 Flash Thinking, which likely uses comparable training methods.

Facing real-world challenges

While researcher Noam Brown sees this scaling trend continuing, there's a catch: the approach requires significant computing resources. That's why OpenAI is already working on o3-mini, a smaller version that aims to maintain strong performance while using fewer resources. The company plans to release this streamlined model in late January.

Former OpenAI researcher and Tesla AI chief Andrej Karpathy recently highlighted some key limitations of using reinforcement learning in language models. When it comes to more subjective tasks like writing style or summarizing text - where success depends more on nuance than clear right or wrong answers - the earlier o1 model doesn't outperform GPT-4o and sometimes falls short. We don't yet have benchmark data showing how o3 handles these more open, "vibe"-based tasks.

Both models also face a greater test: proving themselves in complex real-world scenarios where problems are not neatly defined, may contain contradictions, and require extensive planning, still a weakness of o1.

Recommendation

Despite these challenges, o3's benchmark results tell an impressive story. Tamay Besiroglu, who helped develop the rigorous Frontier Math benchmark, says o3 has blown past expectations. While today's best models typically solve just two percent of these problems, o3 manages to crack about 25 percent - a level of performance Besiroglu thought was at least a year away.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI's o3 language model has made significant progress in benchmark results, particularly in programming and math tasks with clear right/wrong criteria, thanks to the use of reinforcement learning (RL) during training.
  • The performance of o3 is further enhanced by increased computational power during model inference, allowing it to excel at tasks with well-defined solutions.
  • However, the RL approach has limitations when it comes to more open-ended tasks without clear solutions, and the models still need to prove themselves in complex, real-world scenarios where problems are less clearly formulated than in benchmarks.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.