Multi-agent training aims to improve coordination on complex tasks

Researchers have introduced a framework that trains multiple AI agents at the same time, with each agent taking on a specialized role. The aim is to handle complex, multi-step tasks more reliably through clearer division of labor and tighter coordination.

According to researchers at Imperial College London and Ant Group, most current AI systems rely on a single agent that has to plan and act. That setup works for simple tasks, but it breaks down once the process involves long chains of decisions. Errors stack up, and a single agent usually can't excel at both high-level planning and hands-on tool use. Different stages require different mindsets, and single-agent systems often struggle with long, coordinated reasoning in dynamic environments.

Their proposed solution is a structured hierarchy. One agent acts as a project manager who oversees the workflow, while specialized sub-agents handle specific tools like web search or data analysis. The research team found that multi-agent systems with a clear leader can solve tasks almost ten percent faster than systems without defined roles.

Vertical hierarchies work especially well, with a main agent delegating tasks and sub-agents reporting back. Anthropic is testing a similar setup in its recently introduced research agent.

Diagram: Main agent delegates tasks to sub-agents, integrates verified feedback, and performs self-verification loops. — Based on a user query, the main agent breaks the job into subtasks, assigns them to specialized sub-agents, and integrates multiple rounds of verified feedback into a final answer. | Image: Hong et al.

How M-GRPO enables more coordinated training

Most single-agent systems today use Group Relative Policy Optimization, or GRPO. The agent generates several answers to a prompt, compares them, and reinforces the stronger patterns.

Multi-agent systems complicate this process. Agents operate at different frequencies, handle different tasks, and may run on separate servers. Standard training approaches struggle in these conditions. Many systems force all agents to share the same large language model, limiting specialization even though each agent works with different data and responsibilities.

The researchers identify three main challenges. First, the workload is uneven: the main agent works continuously, while sub-agents only run when needed. That creates unstable training data. Second, team sizes vary. Depending on the task, the main agent might call one sub-agent or several, which complicates training. Third, agents often run on separate servers, making typical training methods hard to apply.

The new Multi-Agent Group Relative Policy Optimization, or M-GRPO, extends GRPO so main and sub-agents can be trained together while keeping their roles distinct.

Diagram: Decoupled two-agent architecture with separate rollout generators, SGL server, shared database, agent controller, tool server, and cache service. — The framework lets main and sub-agents train independently while syncing their rollouts through a shared database. A central controller distributes tasks and calls the right tools, enabling coordinated training across multiple servers. | Image: Hong et al.

Each agent is evaluated based on its specific role. The main agent is judged by the quality of the final answer, and sub-agents are evaluated using a mix of their local task performance and their contribution to the overall result. M-GRPO calculates group-relative advantages by comparing each agent's output to the average in its group and adjusting training based on the difference.

Recommendation

AI research

AlphaEvolve is Google DeepMind's new AI system that autonomously creates better algorithms

A trajectory alignment scheme handles the uneven number of sub-agent calls. The system sets a target for how often sub-agents should act and duplicates or drops training data to keep batch sizes consistent. Main and sub-agents can run on different servers and exchange only lightweight statistics through a shared database, keeping cross-server computation minimal.

Why better instructions lead to better results

The researchers trained their M-GRPO system using the Qwen3-30B model on 64 H800 GPUs and tested it on three benchmarks: GAIA for general assistant tasks, XBench-DeepSearch for tool use across domains, and WebWalkerQA for web navigation.

Across all benchmarks, M-GRPO outperformed both single GRPO agents and multi-agent setups with untrained sub-agents. It produced more stable behavior and needed less training data to reach strong performance.

Line graphs: Accuracy of XBench-DeepSearch, GAIA-Text, and WebWalkerQA across Stage 2 training steps, co-training vs. main agent alone. — Co-training main and sub-agents with M-GRPO consistently beats training the main agent alone across XBench, GAIA, and WebWalkerQA. | Image: Hong et al.

Real-world examples show how it helps. In a Rubik's Cube logic task, the trained system chose the correct reasoning tool for mathematical steps, while the untrained system tried to use a browser. In a research task on invasive fish species, the trained main agent issued much more precise instructions. Instead of generally searching for "invasive species Ocellaris Clownfish," it specified "species that 'became invasive after being released by pet owners'."

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Code and datasets are available on GitHub.

Multi-agent training aims to improve coordination on complex tasks

How M-GRPO enables more coordinated training

AlphaEvolve is Google DeepMind's new AI system that autonomously creates better algorithms

Why better instructions lead to better results

Google's Nested Learning aims to stop LLMs from catastrophic forgetting

The future of AI browsing may depend on developers rethinking how they build websites

Meta's SAM 3 segmentation model blurs the boundary between language and vision

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

German court deepens the split on AI and copyright with its latest ruling

Multi-agent training aims to improve coordination on complex tasks

How M-GRPO enables more coordinated training

Why better instructions lead to better results

Share

Bank details