Content
summary Summary

A systematic investigation reveals the methods by which the long chains of thought of reasoning models are generated. The results provide practical tips for optimizing training strategies.

Ad

The team at IN.AI, along with researchers from Tsinghua University and Carnegie Mellon University, has mapped out how AI models develop their ability to work through long chains of thought. Their systematic study used supervised fine-tuning (SFT) and reinforcement learning (RL) to identify the key factors behind this capability.

The research yielded four key insights. First, while SFT makes training more efficient and straightforward, it isn't essential - supporting what Deepseek found with their R1-Zero model. The team tested this using Llama-3.1-8B and Qwen2.5-7B math models, training them with both long and short reasoning chains. They found that SFT with longer chains of thought not only performed better, but also made subsequent RL improvements more effective.

Second, while more computing power during RL training tends to improve reasoning abilities, it's not guaranteed. The length of reasoning chains doesn't always grow steadily during RL training, making the right reward design crucial for consistent improvement.

Ad
Ad

Third, getting reliable reward signals at scale is key to successful RL training. The team explored using web-scraped data with imperfect solutions to scale up these signals. Testing with the WebInstruct dataset, they compared different verification methods and found that rule-based verification worked best when filtering for shorter responses. Using diverse data, even if somewhat noisy, proved especially valuable for handling unusual cases compared to models trained on carefully verified data.

Fourth, while base models already contain core capabilities like error correction, using RL to apply these skills to complex tasks can require significant computing resources.

Larger models still seem to be important

The research suggests that some behaviors, like double-checking solutions, might be learned during pre-training, possibly from human discussions in online forums. RL seems to mainly help models recombine skills they already picked up during pre-training.

The team believes that model size remains the main constraint on developing more sophisticated reasoning abilities in smaller models. They're considering testing RL with larger base models in the future, though the necessary open-source infrastructure for such experiments is still developing.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from IN.AI, Tsinghua University, and Carnegie Mellon University have conducted experiments with supervised fine-tuning (SFT) and reinforcement learning (RL) to identify key factors for learning long strings of thoughts in reasoning models.
  • The study shows that SFT facilitates training and improves efficiency, but is not essential. Reasoning abilities tend to grow with increasing computational power during RL, but their development is not guaranteed. The scaling of verifiable reward signals is central to RL.
  • The researchers hypothesize that incentives for complex reasoning skills may be partially learned during pre-training and that RL primarily instructs the model to recombine them. They believe that model size is the main factor limiting the development of complex reasoning skills in smaller models.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.