Content
summary Summary

Meta researchers have investigated whether reinforcement learning can improve the reasoning ability of large language models.

The researchers compared different algorithms, including Proximal Policy Optimization (PPO) and Expert Iteration (EI), to find out how well they can improve the reasoning ability of language models.

The core idea is that models can generate their own training data through RL fine-tuning. While the rewards serve to guide the models towards the right answers, the exploration enabled by RL is crucial to ensure that the models don't just learn the most obvious solutions, but also develop creative and diverse approaches - at least that's the idea. Projects like Deepmind's AlphaZero or Meta's CICERO have shown that RL can be a powerful tool for this.

Expert iteration has been particularly effective in Meta's experiments. In this method, an initial expert model is applied multiple times to a training set to generate a series of outputs. These are then used to further train the model. Surprisingly, Expert Iteration was almost as efficient as more complex algorithms such as PPO.

Ad
Ad

Reinforcement learning helps - but has its limits

A key finding of the work is that the performance difference between pre-trained models and models that were additionally trained specifically on reasoning with extra data (SFT data) decreased after RL fine-tuning. After a few training iterations, the RL-trained models outperformed the fine-tuned models by almost 10 percent.

Interestingly, none of the RL algorithms benefited significantly from denser rewards, i.e., feedback on individual steps of reasoning. The team concludes that focusing too much on specific rewards can limit the variety of solutions that the model explores.

After a certain iteration of RL training, the performance of the models stopped improving. The team concludes that while the use of pre-trained models provides a good starting point for exploration, the RL methods tested do not allow for significant exploration beyond the pre-training/SFT data.

Thus, one of the main limitations to further improving the logical capabilities of language models is the lack of exploration. Since the models in the RL training phase do not explore significantly beyond what they already know from the pre-training phase, the discovery of new techniques is crucial for progress in the reasoning ability of language models. Some ideas already exist, such as methods like Tree of Thoughts, XOT, or linking language models with evolutionary algorithms. OpenAI is also likely to explore such methods with Q*.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Meta researchers have studied reinforcement learning (RL) to improve the reasoning ability of large language models. They compared algorithms such as Proximal Policy Optimization (PPO) and Expert Iteration (EI).
  • Expert iteration proved to be particularly effective. After several training iterations, the models trained with the RL methods outperformed the fine-tuning models by almost 10%, which was the limit of the tested methods.
  • According to the team, one of the main limitations for further improving the logical capabilities of language models is a strong exploration. New techniques such as Tree of Thoughts, XOT, or the combination of language models with evolutionary algorithms could be crucial for progress in the reasoning capabilities of language models.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.