Researchers find LLMs struggle with exploration, a key capability for useful AI agents

Midjourney prompted by THE DECODER

Researchers investigate whether large language models can effectively exhibit exploratory behavior, considered a key element for useful AI agents.

A team of researchers from Microsoft Research and Carnegie Mellon University investigated the ability of large language models to perform exploration, a key aspect of reinforcement learning and decision-making. They found that common models such as GPT-3.5, GPT-4, and Llama 2 lack robust exploration capabilities without significant external intervention.

In this work, the language models are to act as decision agents in simple Multi-Armed Bandit (MAB) environments within their attention window, i.e. in context. The main tasks of the language models in these scenarios were exploration and exploitation. Exploration here means the ability to gather information to evaluate alternatives and reduce uncertainty by making decisions that may be suboptimal in the short term but provide valuable data in the long term. Exploitation means choosing the option that seems best based on the information gathered so far in order to maximize the immediate reward. Both of these capabilities are important for the practical use of language model-based AI agents.

Specifically, the language models were examined to determine whether and how well they were able to balance these two core aspects of reinforcement learning - exploration and exploitation - in a contextualized environment that is fully described within the model prompt. The experiments included different configurations of prompts and the evaluation of the models' ability to navigate in MAB environments without additional training or intervention.

GPT-4 best with cheat sheet - new methods needed, says team

In most cases, however, the models did not show robust exploration behavior: Either they stopped permanently and never picked the best option, or they spread the choices evenly over all options without excluding the worst ones.

Only a single configuration of GPT-4 with a special prompt design showed successful exploration behavior comparable to two reference algorithms. This prompt provided the model with additional exploration cues, summarized the interaction history, and used chain-of-thought reasoning.

However, according to the team, the results indicate that language models do not have the necessary capabilities for complex decision-making without significant intervention - and are therefore not suitable for AI agents. Simpler problems, such as the multi-armed bandits tested, can be partially solved, but more sophisticated applications will likely require additional fine-tuning or specialized data sets.

The team thus provides a theoretical justification for a phenomenon that can already be observed in practice: AI agent frameworks such as AutoGPT were the focus of much attention at the beginning of the last AI wave, but such AI agents have rarely been used productively.

Companies like OpenAI have been working on better AI agents for some time now, and the implementation of reinforcement learning in the Q* project is likely to play an important role.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

Researchers find LLMs struggle with exploration, a key capability for useful AI agents

GPT-4 best with cheat sheet - new methods needed, says team

Apple's local AI agent framework paves the way for more useful Apple Intelligence

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Grok 4 is not officially instructed to follow Musk’s views but often does on sensitive subjects

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

AI coding can make developers slower even if they feel faster

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

"Cat attack" on reasoning model shows how important context engineering is

Researchers find LLMs struggle with exploration, a key capability for useful AI agents

GPT-4 best with cheat sheet - new methods needed, says team

Share

Bank details