Researchers investigate whether large language models can effectively exhibit exploratory behavior, considered a key element for useful AI agents.
A team of researchers from Microsoft Research and Carnegie Mellon University investigated the ability of large language models to perform exploration, a key aspect of reinforcement learning and decision-making. They found that common models such as GPT-3.5, GPT-4, and Llama 2 lack robust exploration capabilities without significant external intervention.
In this work, the language models are to act as decision agents in simple Multi-Armed Bandit (MAB) environments within their attention window, i.e. in context. The main tasks of the language models in these scenarios were exploration and exploitation. Exploration here means the ability to gather information to evaluate alternatives and reduce uncertainty by making decisions that may be suboptimal in the short term but provide valuable data in the long term. Exploitation means choosing the option that seems best based on the information gathered so far in order to maximize the immediate reward. Both of these capabilities are important for the practical use of language model-based AI agents.
Specifically, the language models were examined to determine whether and how well they were able to balance these two core aspects of reinforcement learning - exploration and exploitation - in a contextualized environment that is fully described within the model prompt. The experiments included different configurations of prompts and the evaluation of the models' ability to navigate in MAB environments without additional training or intervention.
GPT-4 best with cheat sheet - new methods needed, says team
In most cases, however, the models did not show robust exploration behavior: Either they stopped permanently and never picked the best option, or they spread the choices evenly over all options without excluding the worst ones.
Only a single configuration of GPT-4 with a special prompt design showed successful exploration behavior comparable to two reference algorithms. This prompt provided the model with additional exploration cues, summarized the interaction history, and used chain-of-thought reasoning.
However, according to the team, the results indicate that language models do not have the necessary capabilities for complex decision-making without significant intervention - and are therefore not suitable for AI agents. Simpler problems, such as the multi-armed bandits tested, can be partially solved, but more sophisticated applications will likely require additional fine-tuning or specialized data sets.
The team thus provides a theoretical justification for a phenomenon that can already be observed in practice: AI agent frameworks such as AutoGPT were the focus of much attention at the beginning of the last AI wave, but such AI agents have rarely been used productively.
Companies like OpenAI have been working on better AI agents for some time now, and the implementation of reinforcement learning in the Q* project is likely to play an important role.