AI study reveals key factors behind LLM's long-term reasoning abilities

Midjourney prompted by THE DECODER

A systematic investigation reveals the methods by which the long chains of thought of reasoning models are generated. The results provide practical tips for optimizing training strategies.

The team at IN.AI, along with researchers from Tsinghua University and Carnegie Mellon University, has mapped out how AI models develop their ability to work through long chains of thought. Their systematic study used supervised fine-tuning (SFT) and reinforcement learning (RL) to identify the key factors behind this capability.

The research yielded four key insights. First, while SFT makes training more efficient and straightforward, it isn't essential - supporting what Deepseek found with their R1-Zero model. The team tested this using Llama-3.1-8B and Qwen2.5-7B math models, training them with both long and short reasoning chains. They found that SFT with longer chains of thought not only performed better, but also made subsequent RL improvements more effective.

Second, while more computing power during RL training tends to improve reasoning abilities, it's not guaranteed. The length of reasoning chains doesn't always grow steadily during RL training, making the right reward design crucial for consistent improvement.

Third, getting reliable reward signals at scale is key to successful RL training. The team explored using web-scraped data with imperfect solutions to scale up these signals. Testing with the WebInstruct dataset, they compared different verification methods and found that rule-based verification worked best when filtering for shorter responses. Using diverse data, even if somewhat noisy, proved especially valuable for handling unusual cases compared to models trained on carefully verified data.

Fourth, while base models already contain core capabilities like error correction, using RL to apply these skills to complex tasks can require significant computing resources.

Larger models still seem to be important

The research suggests that some behaviors, like double-checking solutions, might be learned during pre-training, possibly from human discussions in online forums. RL seems to mainly help models recombine skills they already picked up during pre-training.

The team believes that model size remains the main constraint on developing more sophisticated reasoning abilities in smaller models. They're considering testing RL with larger base models in the future, though the necessary open-source infrastructure for such experiments is still developing.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI study reveals key factors behind LLM's long-term reasoning abilities

Larger models still seem to be important

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Grok 4 is not officially instructed to follow Musk’s views but often does on sensitive subjects

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

AI coding can make developers slower even if they feel faster

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

"Cat attack" on reasoning model shows how important context engineering is

AI study reveals key factors behind LLM's long-term reasoning abilities

Larger models still seem to be important

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Grok 4 is not officially instructed to follow Musk’s views but often does on sensitive subjects

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks