Content
summary Summary

Large language models (LLMs) can make good decisions in theory, but in practice, they often fall short.

Ad

A recent preprint study from JKU Linz and Google DeepMind looks at why LLMs struggle with decision-making tasks like multi-armed bandits and tic-tac-toe. In experiments using text-based versions of these problems, the researchers found that models often made poor decisions, even when they had the information needed to choose correctly.

The team ran 50-step decision-making tests using different versions of these problems, with setups that included five, ten, or twenty possible actions and varying levels of randomness. They identified three core issues: greediness, frequency bias, and a disconnect between what the model knows and what it actually does, a phenomenon referred to as the "knowing-doing gap."

The study focused on Google's Gemma 2 model family in three sizes: 2B, 9B, and 27B parameters. Whether the findings apply to larger "frontier models" remains an open question.

Ad
Ad

Stuck in routines

LLMs tend to latch onto early actions that seem promising and then stop exploring other options. In tests with ten possible actions, even the largest models only tried about two-thirds of them. Without chain-of-thought (CoT) reasoning, that number dropped even further.

Smaller models showed another problem: they mostly chose actions that had appeared most frequently so far, even if those actions didn't lead to success. This behavior, known as frequency bias, showed up in 96% of cases in a model with 2 billion parameters when an action was repeated multiple times. Larger models were less likely to make this mistake, but they were even more prone to greedy behavior.

Knowing what's best — and doing something else

Another major issue was the "knowing-doing gap." In one experiment, the models used the Upper Confidence Bound (UCB) algorithm to identify the best possible action and got it right 87% of the time. But in 58% of those cases, they still chose a different option, usually falling back on an action that had worked before.

This kind of disconnect is familiar to anyone who has worked with LLMs: a model can explain its own mistake, then turn around and make it again. To help close the gap, the researchers applied reinforcement learning fine-tuning (RLFT), training the models to generate their own explanations (CoT rationales) and learn from them.

After 30,000 training steps, overall performance improved. The smallest model, Gemma2-2B, explored 12% more actions and made fewer mistakes. In a tic-tac-toe test, its win rate against a random opponent jumped from 15% to 75%. The model also held its own against a stronger AI opponent that used Monte Carlo Tree Search — the same strategy behind AlphaZero. In that case, it managed to force a draw, but only when it had access to contextual information about which actions were allowed.

Recommendation

Exploration is still a weak spot

Before any optimization, the researchers found that the smallest model explored just 40% of the available options in a ten-action setup. Larger models did better, covering around 65%, but without chain-of-thought (CoT) reasoning, that number dropped to only 25%. When the number of possible actions increased to 20, even the largest models explored just 45%, and exploration usually stalled after about ten steps.

Training helped somewhat, but the models still showed a reluctance to try unfamiliar actions. To address this, the researchers tested several methods to boost exploration. These included adding randomness early on, rewarding new actions with an exploration bonus, and using self-correction strategies.

The simplest method worked best: force the model to try every possible action once at the beginning. This "try-all" approach brought performance close to optimal. Giving a bonus point for each newly attempted action also proved effective, increasing action coverage from 50% to 70%.

More tokens, better decisions

The experiments also highlighted just how crucial chain-of-thought reasoning is. Without it, even follow-up training had minimal effect. Another key variable was "thinking time" — the number of tokens the model was allowed to use to reason through a decision. More tokens led to better results, though at the cost of higher computational demands.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

This combination of CoT training and extended token budgets underpins the progress of today's more advanced reasoning models in areas like coding and math.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at JKU Linz and Google Deepmind have found that language models in decision-making tend to act too greedily, favor frequent choices, and struggle to translate knowledge into action.
  • Training with reinforcement learning, explicit reasoning, and specific rewards helps these models consider more options, make fewer errors, and perform better in tasks like tic-tac-toe.
  • Even with these improvements, models still find it hard to try new strategies; mandatory exploration and longer reasoning periods can further boost performance, especially when models are given extra time to decide.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.