Content
summary Summary

A new study from Microsoft and Salesforce finds that even state-of-the-art AI language models become dramatically less reliable as conversations get longer and users reveal their requirements step by step. On average, the systems' performance dropped by 39 percent in these scenarios.

Ad

To simulate how people actually interact with AI assistants, the researchers created a method called "sharding." Instead of giving the model everything up front, they broke each task into smaller pieces—mirroring the way users typically specify what they want over the course of a conversation.

Comparison of complete and broken down instruction examples for six task types, with sources and evaluation metrics.
Each experiment involved 90 to 120 instructions, broken down into smaller sub-tasks from high-quality datasets. | Image: Laban et al.

When the AI models had to handle these step-by-step instructions, their accuracy dropped from about 90 percent to just 51 percent. This steep decline was seen across all 15 models tested, from smaller open-source models like Llama-3.1-8B to big commercial systems like GPT-4o.

Even the top-tier models in the study—Claude 3.7 Sonnet, Gemini 2.5 Pro, and GPT-4.1—performed 30 to 40 percent worse in multi-round conversations compared to single-turn prompts. The models also became much less consistent, with results varying far more than in one-shot interactions.

Ad
Ad

The researchers identified four key problems: models frequently rush to conclusions prematurely, without having all the necessary details. They rely excessively on their own past (occasionally wrong) responses, neglect information from the middle of the discussion, and produce overly detailed responses, leading to incorrect assumptions about gaps in information.

Diagram: LLMs demonstrate lower suitability and higher unreliability in multi-turn conversations than in single-turn scenarios.
Large language models often get "lost" in multi-step, underspecified conversations, leading to a significant drop in performance. | Image: Laban et al.

Technical tweaks don’t fix the problem

The team tried several technical fixes to improve reliability, such as lowering the model’s temperature setting (which controls randomness) and having an agent repeat user instructions. None of these changes made a meaningful difference in performance.

Changing how much detail was given at each step didn't help either. The only thing that reliably worked was giving the AI all the information right at the beginning.

The performance drop had two parts: The models' basic skill (aptitude) decreased only slightly—by about 16 percent. But unreliability soared by 112 percent. In single-turn tasks, models with higher aptitude were usually more reliable. But in multi-turn chats, all models were similarly unreliable, regardless of their baseline skill. For the same task, results could swing by as much as 50 percentage points between the best and worst run.

What users and developers can do

Based on these results, the researchers suggest two practical strategies: If you get stuck, start a new conversation instead of trying to fix one that's gone off track. And at the end of a session, ask for a summary of all requirements and use that summary as the starting point for a new chat.

Recommendation

The researchers say that AI developers should put much more emphasis on reliability in multi-turn conversations. Future models should be able to deliver consistently good results even when the instructions are incomplete—without relying on special prompting tricks or constant temperature adjustments. Reliability matters just as much as raw performance, especially for real-world AI assistants, where conversations tend to be step-by-step and user needs can change along the way.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study by Microsoft and Salesforce found that AI language models perform much less reliably in conversations that require multiple steps and increasingly precise instructions, with their success rate falling from about 90 percent to 51 percent on average.
  • The researchers traced this drop in performance to four main issues: suggesting solutions too early, depending on wrong earlier answers, ignoring information shared midway, and using overly detailed responses that include incorrect assumptions.
  • Technical tweaks like lowering the temperature setting or repeating user instructions made little difference; only sharing all required information at the very start of the conversation prevented large declines in the models' performance.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.