Microsoft tests AI agents that can order shawarma and ask humans for help

Microsoft Research has created a new AI system called Magentic-One that can handle complex computer tasks by working with web content and files.

The system uses multiple specialized AI agents instead of a single agent to get work done. It centers around a main coordinator that plans tasks, tracks progress, and fixes problems using structured records. Four specialized agents handle specific jobs: one browses the web, another works with files, a third writes code, and a fourth runs that code.

Flowchart: Orchestrator system with task/progress ledger and four specialized agents (Coder, Terminal, WebSurfer, FileSurfer). — The specialized agents each take on a dedicated function: the WebSurfer navigates and interacts with web content, the FileSurfer processes files, the Coder generates code and the ComputerTerminal executes it. | Image: Microsoft

Breaking down complex tasks

Microsoft's tests showed that splitting functions into separate agents makes the system easier to develop and maintain. The modular design lets developers add or remove agents without changing other parts of the system. Each agent can also be fine-tuned for its specific task, potentially reducing the need for large, resource-heavy AI models, the researchers say.

The team's experiments showed how each agent contributed to overall performance. When they removed an agent from the system, performance dropped for tasks that required that agent's specific skills.

Video: Microsoft Research

They tested different AI models with the system. While they mainly used GPT-4o for its ability to work with images and text, they found that using OpenAI's newer o1-preview model for some components improved performance.

When tested against other AI systems on standard benchmarks like GAIA, WebArena, and AssistantBench, Magentic-One performed similarly well, especially on complicated tasks with multiple steps.

Bar chart: Comparison of the accuracy of various AI models and humans at three benchmarks, with error bars. — In the GAIA, AssistantBench and WebArena benchmarks, Magentic-One is on a par with other agent systems. | Image: Microsoft

To measure Magentic-One's capabilities, the researchers also created a new testing framework called AutoGenBench. This system allows them to run controlled, repeatable tests of the AI agents' performance.

Recommendation

AI research

Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

The automated error analysis showed several weak points: the agents often got stuck in inefficient patterns, failed to validate their results, and didn't always navigate efficiently through tasks.

Unexpected behaviors

The researchers noted some concerning behaviors during testing. The AI agents sometimes got stuck in inefficient loops or failed to check their work properly. In one example, they observed agents repeatedly trying to log into websites until accounts got locked, then attempting to reset passwords.

More worryingly, the agents occasionally tried to reach out to humans without being told to do so – including attempting to post on social media, email authors, and even file government information requests.

The researchers emphasize that AI agents operating autonomously in digital spaces designed for humans come with inherent risks that need careful consideration.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Microsoft's research joins similar work from other companies attempting to create AI that can use computers through natural language commands. Anthropic recently showed progress with its Claude Computer Use system, while Google ("Jarvis") and OpenAI ("Operator") plan to reveal their own versions called Jarvis and Operator in the coming weeks.

Microsoft tests AI agents that can order shawarma and ask humans for help

Breaking down complex tasks

Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

Unexpected behaviors

OpenAI expands cloud partnerships, taps Google for ChatGPT infrastructure in multiple countries

OpenAI is testing ChatGPT agents that create and edit presentations and spreadsheets in chat

Microsoft introduces Phi-4-mini-flash-reasoning with up to 10x higher token throughput

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Microsoft tests AI agents that can order shawarma and ask humans for help

Breaking down complex tasks

Unexpected behaviors

Share

Bank details