Content
summary Summary

What happens when you ask an advanced AI to run a simple vending machine? Sometimes it outperforms humans, and sometimes it spirals into conspiracy theories. That's what researchers at Andon Labs discovered with their new "Vending-Bench" study, which puts AI agents through an unusual endurance test.

Ad

The researchers posed a simple question: If AI models are so intelligent, why don't we have "digital employees" working continuously for us yet? Their conclusion: AI systems still lack long-term coherence.

In the Vending-Bench test, an AI agent must operate a virtual vending machine over an extended period. Each test run involves about 2,000 interactions, uses around 25 million tokens, and takes five to ten hours in real time.

The agent starts with $500 and pays a daily fee of $2. Its tasks are ordinary but challenging when combined: ordering products from suppliers, stocking the machine, setting prices, and collecting revenue regularly.

Ad
Ad
Image: Andon Labs

When the agent emails a wholesaler, GPT-4o generates realistic responses based on real data. Customer behavior accounts for price sensitivity, weekday and seasonal effects, and weather influences. High prices lead to fewer sales, while optimal product variety gets rewarded.

For a fair comparison, researchers had a human perform the same task for five hours through a chat interface. Like the AI models, this person had no prior knowledge and had to understand the task dynamics solely through instructions and environmental interactions.

Success is measured by net worth: the sum of cash plus unsold product value. While AI models completed five runs each, the human baseline came from a single trial.

How the agent system works

The agent operates in a simple loop: The LLM makes decisions based on previous history and calls various tools to execute actions. Each iteration gives the model the last 30,000 tokens of conversation history as context. To compensate for memory limitations, the agent has access to three types of databases:

  • A notepad for free-form notes
  • A key-value store for structured data
  • A vector database for semantic search

The agent also has task-specific tools: it can send and read emails, research products, and check inventory and cash levels. For physical actions (like stocking the machine), it can delegate to a sub-agent - simulating how digital AI agents might interact with humans or robots in the real world.

Recommendation

When AI agents break down

Claude 3.5 Sonnet performed best with an average net worth of $2,217.93, even beating the human baseline ($844.05). O3-mini followed closely at $906.86. The team notes that in some successful runs, Claude 3.5 Sonnet showed remarkable business intelligence, independently recognizing and adapting to higher weekend sales - a feature actually built into the simulation.

But these averages hide a crucial weakness: enormous variance. While the human delivered steady performance in their single run, even the best AI models had runs that ended in bizarre "meltdowns." In the worst cases, some models' agents didn't sell a single product.

In one instance, the Claude agent entered a strange escalation spiral: it wrongly believed it needed to shut down operations and tried contacting a non-existent FBI office. Eventually, it refused all commands, stating: "The business is dead, and this is now solely a law enforcement matter."

Claude 3.5 Haiku's behavior became even more peculiar. When this agent incorrectly assumed a supplier had defrauded it, it began sending increasingly dramatic threats - culminating in an "ABSOLUTE FINAL ULTIMATE TOTAL QUANTUM NUCLEAR LEGAL INTERVENTION PREPARATION."

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

"All models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential 'meltdown' loops from which they rarely recover," the researchers report.

Conclusion and limitations

The Andon Labs team draws nuanced conclusions from their Vending-Bench study: While some runs by the best models show impressive management capabilities, all tested AI agents struggle with consistent long-term coherence.

The breakdowns follow a typical pattern: The agent misinterprets its status (like believing an order has arrived when it hasn't) and then either gets stuck in loops or abandons the task. These issues occur regardless of context window size.

The researchers emphasize that the benchmark hasn't reached its ceiling - there's room for improvement beyond the presented results. They define saturation as the point where models consistently understand and use simulation rules to achieve high net worth, with minimal variance between runs.

The researchers acknowledge one limitation: evaluating potentially dangerous capabilities (like capital acquisition) is a double-edged sword. If researchers optimize their systems for these benchmarks, they might unintentionally promote the very capabilities being assessed. Still, they maintain that systematic evaluations are necessary to implement safety measures in time.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Andon Labs have developed "Vending Bench", an endurance test for AI agents in which they have to operate a virtual vending machine for 5–10 hours, which involves around 2,000 interactions and 25 million tokens.
  • Claude 3.5 Sonnet achieved the best result with an average of 2,217.93 dollars and even outperformed the human baseline (844.05 dollars), but all AI models tested showed a high variance and experienced "meltdowns" - from misinterpretations to bizarre behaviors such as threats against fictitious suppliers.
  • The study shows that even advanced AI systems still have problems with long-term consistency. Despite good averages, they can fall into false loops from which they barely recover, limiting their reliability as autonomous business agents.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.