A research team from MIT, IBM, and the University of Washington has released TOUCAN, the largest open dataset to date for training AI agents. The dataset contains 1.5 million real tool interactions, aiming to help open models handle external tools more effectively.
The team from the MIT-IBM Watson AI Lab and the University of Washington built TOUCAN to address a gap in the field: there are almost no openly licensed training datasets showing language models how to use real-world tools correctly. TOUCAN includes 1.5 million tool interactions from real Model Context Protocol (MCP) environments.
The dataset covers 495 real MCP servers with more than 2,000 different tools, spanning everything from web search and development platforms to finance, weather, and AI services. Each entry documents a complete chain: the initial task, tool calls, responses, and end result.
Real Tool Execution Instead of Simulation
Earlier open datasets like ToolLLM and ToolACE mostly relied on simulated tool responses. TOUCAN, in contrast, uses actual API executions in real environments, capturing more realistic errors, delays, and context dependencies - issues that often cause trouble in real agent systems.
The data was generated with a five-stage pipeline. First, researchers collected and audited MCP servers from Smithery.ai. Then, five different language models (including Mistral, Kimi-K2, and Qwen3-32B) created training tasks that were filtered for quality, realism, and traceability in multiple rounds. Three more models turned these tasks into real interaction histories using actual tool calls.
The dataset was further expanded in three ways: by adding unsolvable tasks to help reduce model errors, creating variants with different roles or contexts, and building longer dialogue chains with multiple user inputs.
Improved Tool Use for Open Models
In tests with three open Qwen-2.5 models (7B, 14B, and 32B parameters), researchers saw clear performance gains. On the BFCL V3 benchmark, the Qwen-2.5-32B model’s score increased by 8.7 percentage points after fine-tuning with TOUCAN, outperforming GPT-4.5-Preview in several areas. Results on τ-Bench, τ²-Bench, and MCP-Universe benchmarks showed improvements of three to seven points compared to baseline models.
On the MCP-Universe benchmark - which tests real tool interfaces - the TOUCAN-tuned models even outperformed larger open systems like Llama-3.3 (70B) and GLM-4.5 (106B). According to the researchers, this visibly shifts the efficiency frontier for smaller models.
Significance and Limitations
TOUCAN makes it easier to train open-source models to work with real tools, a domain where closed systems like GPT-5 and Claude 4.5 currently dominate. It also underscores how much training data matters: smaller models can now solve tasks at rates similar to older proprietary systems, though they still lag behind the latest generation.
The research team says all MCP data was collected from public sources, and personal information was pre-processed and removed. The code and dataset are available on GitHub and Hugging Face under a permissive license. Future plans include an expert model for tool simulation and a web search benchmark.