Meta researchers have investigated whether reinforcement learning can improve the reasoning ability of large language models.
The researchers compared different algorithms, including Proximal Policy Optimization (PPO) and Expert Iteration (EI), to find out how well they can improve the reasoning ability of language models.
The core idea is that models can generate their own training data through RL fine-tuning. While the rewards serve to guide the models towards the right answers, the exploration enabled by RL is crucial to ensure that the models don't just learn the most obvious solutions, but also develop creative and diverse approaches - at least that's the idea. Projects like Deepmind's AlphaZero or Meta's CICERO have shown that RL can be a powerful tool for this.
Expert iteration has been particularly effective in Meta's experiments. In this method, an initial expert model is applied multiple times to a training set to generate a series of outputs. These are then used to further train the model. Surprisingly, Expert Iteration was almost as efficient as more complex algorithms such as PPO.
Reinforcement learning helps - but has its limits
A key finding of the work is that the performance difference between pre-trained models and models that were additionally trained specifically on reasoning with extra data (SFT data) decreased after RL fine-tuning. After a few training iterations, the RL-trained models outperformed the fine-tuned models by almost 10 percent.
Interestingly, none of the RL algorithms benefited significantly from denser rewards, i.e., feedback on individual steps of reasoning. The team concludes that focusing too much on specific rewards can limit the variety of solutions that the model explores.
After a certain iteration of RL training, the performance of the models stopped improving. The team concludes that while the use of pre-trained models provides a good starting point for exploration, the RL methods tested do not allow for significant exploration beyond the pre-training/SFT data.
Thus, one of the main limitations to further improving the logical capabilities of language models is the lack of exploration. Since the models in the RL training phase do not explore significantly beyond what they already know from the pre-training phase, the discovery of new techniques is crucial for progress in the reasoning ability of language models. Some ideas already exist, such as methods like Tree of Thoughts, XOT, or linking language models with evolutionary algorithms. OpenAI is also likely to explore such methods with Q*.