Large language models often struggle with decision-making

Large language models (LLMs) can make good decisions in theory, but in practice, they often fall short.

A recent preprint study from JKU Linz and Google DeepMind looks at why LLMs struggle with decision-making tasks like multi-armed bandits and tic-tac-toe. In experiments using text-based versions of these problems, the researchers found that models often made poor decisions, even when they had the information needed to choose correctly.

The team ran 50-step decision-making tests using different versions of these problems, with setups that included five, ten, or twenty possible actions and varying levels of randomness. They identified three core issues: greediness, frequency bias, and a disconnect between what the model knows and what it actually does, a phenomenon referred to as the "knowing-doing gap."

The study focused on Google's Gemma 2 model family in three sizes: 2B, 9B, and 27B parameters. Whether the findings apply to larger "frontier models" remains an open question.

Stuck in routines

LLMs tend to latch onto early actions that seem promising and then stop exploring other options. In tests with ten possible actions, even the largest models only tried about two-thirds of them. Without chain-of-thought (CoT) reasoning, that number dropped even further.

Smaller models showed another problem: they mostly chose actions that had appeared most frequently so far, even if those actions didn't lead to success. This behavior, known as frequency bias, showed up in 96% of cases in a model with 2 billion parameters when an action was repeated multiple times. Larger models were less likely to make this mistake, but they were even more prone to greedy behavior.

Knowing what's best — and doing something else

Another major issue was the "knowing-doing gap." In one experiment, the models used the Upper Confidence Bound (UCB) algorithm to identify the best possible action and got it right 87% of the time. But in 58% of those cases, they still chose a different option, usually falling back on an action that had worked before.

This kind of disconnect is familiar to anyone who has worked with LLMs: a model can explain its own mistake, then turn around and make it again. To help close the gap, the researchers applied reinforcement learning fine-tuning (RLFT), training the models to generate their own explanations (CoT rationales) and learn from them.

After 30,000 training steps, overall performance improved. The smallest model, Gemma2-2B, explored 12% more actions and made fewer mistakes. In a tic-tac-toe test, its win rate against a random opponent jumped from 15% to 75%. The model also held its own against a stronger AI opponent that used Monte Carlo Tree Search — the same strategy behind AlphaZero. In that case, it managed to force a draw, but only when it had access to contextual information about which actions were allowed.

Recommendation

AI research

New Othello experiment supports the world model hypothesis for large language models

Exploration is still a weak spot

Before any optimization, the researchers found that the smallest model explored just 40% of the available options in a ten-action setup. Larger models did better, covering around 65%, but without chain-of-thought (CoT) reasoning, that number dropped to only 25%. When the number of possible actions increased to 20, even the largest models explored just 45%, and exploration usually stalled after about ten steps.

Training helped somewhat, but the models still showed a reluctance to try unfamiliar actions. To address this, the researchers tested several methods to boost exploration. These included adding randomness early on, rewarding new actions with an exploration bonus, and using self-correction strategies.

The simplest method worked best: force the model to try every possible action once at the beginning. This "try-all" approach brought performance close to optimal. Giving a bonus point for each newly attempted action also proved effective, increasing action coverage from 50% to 70%.

More tokens, better decisions

The experiments also highlighted just how crucial chain-of-thought reasoning is. Without it, even follow-up training had minimal effect. Another key variable was "thinking time" — the number of tokens the model was allowed to use to reason through a decision. More tokens led to better results, though at the cost of higher computational demands.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

This combination of CoT training and extended token budgets underpins the progress of today's more advanced reasoning models in areas like coding and math.

Large language models often struggle with decision-making — a new study explains why

Stuck in routines

Knowing what's best — and doing something else

New Othello experiment supports the world model hypothesis for large language models

Exploration is still a weak spot

More tokens, better decisions

Warmer-sounding LLMs are more likely to repeat false information and conspiracy theories

Apple's "Illusion of Thinking" paper shows experts deeply divided on AI reasoning

AI chatbots become dramatically less reliable in longer conversations, new study finds

Physicist Steve Hsu publishes research built around a core idea generated by GPT-5

The ARC benchmark's fall marks another casualty of relentless AI optimization

DeepseekMath-V2 is Deepseek's latest attempt to pop the US AI bubble

Large language models often struggle with decision-making — a new study explains why

Stuck in routines

Knowing what's best — and doing something else

Exploration is still a weak spot

More tokens, better decisions

Share

Bank details