Content
summary Summary

A team of researchers from MIT, Cornell University, the University of Washington and Microsoft Research developed a framework called "Reinforcement Learning via Self-Play" (RLSP) that teaches large language models to spend more time working through problems. The approach mirrors techniques used in successful AI models like OpenAI's o1, o3, Deepseek's R1 and Google's Gemini.

Ad

RLSP works in three stages: First, the model learns from examples of human or AI reasoning (SFT). Then, it's rewarded for exploring different approaches to problems (RL). Finally, the system checks answers to ensure accuracy and prevent shortcuts (Verifier).

Testing shows promising results. When applied to Llama models, RLSP improved scores on the MATH 500 dataset by 23%. The Qwen2.5-32B-Instruct model saw a 10% boost on AIME 2024 math problems. Even with basic rewards for showing work, the models developed interesting behaviors like backtracking, exploring multiple solutions, and double-checking their answers.

These results are largely in line with findings reported by the team behind Deepseek R1 and R1-Zero, as well as recently by researchers from IN.AI, Tsinghua University and Carnegie Mellon University.

Ad
Ad
In this presentation, the team shows that RLSP does not yet lead to higher forms of reasoning in their experiments.

The most notable finding isn't just better test scores - it's how the models learn to solve problems. Even without specific training examples, but with small rewards for exploration, the models developed several useful behaviors across different types of problems.

Why RLSP works

The researchers think they know why this works: Recent studies show that "chain-of-thought" reasoning - where models write out their thinking step by step - gives them more computational power to solve problems. RLSP encourages models to create new reasoning paths through "self-play," similar to how AI learned to master games like chess and Go.

The reward system encourages models to show all their work, even when some approaches don't lead to the right answer. When a model finds the correct solution through a longer reasoning process, it gets full credit. This generates new examples of step-by-step reasoning that help the model improve.

Questions for future research

The team notes several remaining challenges. They want to know how models could adjust their thinking time based on problem difficulty - spending less time on simple math and more on complex proofs. They're also curious about how context length affects reasoning and whether these behaviors truly go beyond what models see in training data.

Other open questions include whether pure reinforcement learning without exploration rewards could improve reasoning in larger models, and what additional training methods might help models develop higher-level thinking skills like forming theorems and tackling open-ended problems.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from MIT, Cornell University, the University of Washington, and Microsoft Research have developed a framework called "Reinforcement Learning via Self-Play" (RLSP) that teaches large language models to spend more time working through problems, using techniques similar to those in successful AI models like OpenAI's o1, o3, Deepseek's R1, and Google's Gemini.
  • RLSP works in three stages: learning from examples of human or AI reasoning, being rewarded for exploring different problem-solving approaches, and checking answers for accuracy and to prevent shortcuts. When applied to Llama models and the Qwen2.5-32B-Instruct model, RLSP improved scores on math datasets by 23% and 10%, respectively.
  • The most notable finding is not just better test scores, but how the models learn to solve problems. Even without specific training examples, the models developed useful behaviors like backtracking, exploring multiple solutions, and double-checking their answers when given small rewards for exploration. The researchers believe this works because "chain-of-thought" reasoning gives models more computational power to solve problems, and the reward system encourages them to show their work and generate new examples of step-by-step reasoning.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.