New Stanford study reveals when teaming up AI agents is worth the compute
Multi-agent AI systems are widely considered more capable. A Stanford study shows their apparent advantage largely comes from using more compute. But there are important exceptions.
A popular approach in AI research right now is multi-agent systems: multiple AI models split up a task, debate each other, or cross-check results. The idea is that teamwork leads to better answers, especially for complex problems that require multiple reasoning steps.
Researchers at Stanford University are now challenging that assumption at its core. Their central claim: when a single agent and a team get the same amount of compute, the solo agent performs at least as well.
Every handoff loses information
The explanation, according to the researchers: when multiple agents collaborate, they have to pass intermediate results back and forth. Each handoff risks losing relevant information. A single agent, by contrast, keeps everything in one continuous reasoning process.
The team tested four different models (Qwen3-30B-A3B, DeepSeek-R1-Distill-Llama-70B, and Gemini 2.5 Flash and Pro) on two multi-step reasoning benchmarks. They compared a single agent against five different team architectures, including sequential chains, debates, and ensemble approaches.
The results were clear: given the same compute budget, the single agent was almost always the best or an equivalent option. It also used significantly fewer resources than the teams.
Long contexts remain a weak spot for solo agents
The study does acknowledge that the single agent's theoretical advantage only holds when it handles context perfectly. In practice, language models struggle with this - the longer a reasoning process gets, the harder it becomes to separate relevant information from noise. Researchers call these phenomena "context rot" and the "lost in the middle" effect, where models overlook information buried in the middle of long texts.
This is exactly where teams can pull ahead. In experiments with deliberately corrupted input text, structured teams outperformed the single agent when distortion was high, because splitting up the work helped filter out relevant information more effectively. The study also found that teams benefited more when built on weaker base models. Error analysis showed that single agents sometimes think too narrowly, while teams cast a wider net and occasionally find answers the solo agent misses. The debate architecture proved to be the strongest team setup overall.
The study is limited to text-based reasoning tasks. Whether teams offer advantages for tool use or image processing isn't covered in the preprint.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe nowAI news without the hype
Curated by humans.
- More than 16% discount.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.