Study claims 78 training examples are enough to build autonomous agents

A new study challenges a core assumption in AI: instead of massive datasets, just 78 carefully chosen training examples may be enough to build superior autonomous agents.

Researchers from several Chinese institutions make this case in their LIMI ("Less Is More for Intelligent Agency") paper. They define "agency" as the ability of AI systems to act independently - discovering problems, forming hypotheses, and solving tasks through self-directed interaction with environments and tools.

LIMI flips the usual script for AI training. Instead of massive datasets, it uses just 78 handpicked examples from real software development and research tasks. Each one captures the full process of human-AI teamwork, from the first request through tool use, problem-solving, and final success. The goal: teach models to act as real autonomous agents.

On the AgencyBench benchmark, LIMI scored 73.5 percent using only 78 training samples. AgencyBench covers real-world scenarios such as building C++ chat apps, Java to-do lists, AI-powered games, microservice pipelines, and research tasks like LLM comparisons, data analysis, and business or sports analytics.

The performance gap is striking: Deepseek-V3.1 scored 11.9 percent, Kimi-K2-Instruct 24.1 percent, Qwen3-235B-A22B-Instruct 27.5 percent, and GLM-4.5 45.1 percent.

Dual plot: Bar chart showing agent performance (LIMI 73.5% on AgencyBench) and efficiency shows 53.7% profit with 0.8% samples. — LIMI delivers a 53.7 percent gain over models trained on 10,000 samples, using 128 times less data. | Image: Xiao et al.

LIMI also nailed 71.7 percent of requirements on the first try, nearly doubling the best baseline. Its overall success rate was 74.6 percent, far ahead of GLM-4.5's 47.4 percent. On standard coding and scientific computing benchmarks, LIMI posted a 57.2 percent average, again leading all baselines. Alternative training approaches highlight the efficiency gap: for example, GLM-4.5 code, trained on 10,000 samples, reached only 47.8 percent on AgencyBench.

Histogram of trajectory lengths (13 k–152 k tokens, Ø 42.4 k) plus ring diagram showing domain coverage in vibe coding and research workflows. — The LIMI team focused on two main domains: collaborative "vibe coding" for software development and research workflows for scientific tasks. The study argues these areas cover most knowledge work. | Image: Xiao et al.

Some trajectories stretched to 152,000 tokens, highlighting the depth and complexity of the autonomous behaviors LIMI was able to learn.

Flussdiagramm: Nutzeranfragen aus GitHub-PR-Synthese und realen Fällen durch Qualitätsprüfung zum Query-Pool, danach Trajektorien in CLI erfasst. — The data captures complete collaborative workflows, from initial task understanding to iterative model reasoning, tool use, and final task success. | Image: Xiao et al.

Rethinking AI training methods

The LIMI approach works across different model sizes. LIMI-Air, with 106 billion parameters, improved from 17.0 percent to 34.3 percent; the larger LIMI, with 355 billion parameters, jumped from 45.1 percent to 73.5 percent.

The results could change how autonomous AI systems are developed. While traditional methods rely on ever-larger training pipelines and massive datasets, LIMI points toward a different path. This approach looks promising, but it will need more research and real-world testing before it can become the new standard. The code, models, and datasets are all public.

Recommendation

AI research

AI language models struggle to connect the dots in long texts, study finds

Nvidia researchers have recently argued that most AI agents use language models that are far too large, and that models under 10 billion parameters may be sufficient for agentic tasks. LIMI's results back this up, offering empirical evidence that careful data curation can beat brute-force scaling.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Study claims 78 training examples are enough to build autonomous agents

Rethinking AI training methods

AI language models struggle to connect the dots in long texts, study finds

Nvidia researchers urge the AI industry to rethink agentic AI in favor of smaller, more efficient LLMs

OpenAI co-founder Sutskever predicts a new AI "age of discovery" as LLM scaling hits a wall

Physicist Steve Hsu publishes research built around a core idea generated by GPT-5

The ARC benchmark's fall marks another casualty of relentless AI optimization

DeepseekMath-V2 is Deepseek's latest attempt to pop the US AI bubble

Study claims 78 training examples are enough to build autonomous agents

Rethinking AI training methods

AI language models struggle to connect the dots in long texts, study finds

Nvidia researchers urge the AI industry to rethink agentic AI in favor of smaller, more efficient LLMs

OpenAI co-founder Sutskever predicts a new AI "age of discovery" as LLM scaling hits a wall