Content
summary Summary

A new study challenges a core assumption in AI: instead of massive datasets, just 78 carefully chosen training examples may be enough to build superior autonomous agents.

Ad

Researchers from several Chinese institutions make this case in their LIMI ("Less Is More for Intelligent Agency") paper. They define "agency" as the ability of AI systems to act independently - discovering problems, forming hypotheses, and solving tasks through self-directed interaction with environments and tools.

LIMI flips the usual script for AI training. Instead of massive datasets, it uses just 78 handpicked examples from real software development and research tasks. Each one captures the full process of human-AI teamwork, from the first request through tool use, problem-solving, and final success. The goal: teach models to act as real autonomous agents.

On the AgencyBench benchmark, LIMI scored 73.5 percent using only 78 training samples. AgencyBench covers real-world scenarios such as building C++ chat apps, Java to-do lists, AI-powered games, microservice pipelines, and research tasks like LLM comparisons, data analysis, and business or sports analytics.

Ad
Ad

The performance gap is striking: Deepseek-V3.1 scored 11.9 percent, Kimi-K2-Instruct 24.1 percent, Qwen3-235B-A22B-Instruct 27.5 percent, and GLM-4.5 45.1 percent.

Dual plot: Bar chart showing agent performance (LIMI 73.5% on AgencyBench) and efficiency shows 53.7% profit with 0.8% samples.
LIMI delivers a 53.7 percent gain over models trained on 10,000 samples, using 128 times less data. | Image: Xiao et al.

LIMI also nailed 71.7 percent of requirements on the first try, nearly doubling the best baseline. Its overall success rate was 74.6 percent, far ahead of GLM-4.5's 47.4 percent. On standard coding and scientific computing benchmarks, LIMI posted a 57.2 percent average, again leading all baselines. Alternative training approaches highlight the efficiency gap: for example, GLM-4.5 code, trained on 10,000 samples, reached only 47.8 percent on AgencyBench.

Histogram of trajectory lengths (13 k–152 k tokens, Ø 42.4 k) plus ring diagram showing domain coverage in vibe coding and research workflows.
The LIMI team focused on two main domains: collaborative "vibe coding" for software development and research workflows for scientific tasks. The study argues these areas cover most knowledge work. | Image: Xiao et al.

Some trajectories stretched to 152,000 tokens, highlighting the depth and complexity of the autonomous behaviors LIMI was able to learn.

Flussdiagramm: Nutzeranfragen aus GitHub-PR-Synthese und realen Fällen durch Qualitätsprüfung zum Query-Pool, danach Trajektorien in CLI erfasst.
The data captures complete collaborative workflows, from initial task understanding to iterative model reasoning, tool use, and final task success. | Image: Xiao et al.

Rethinking AI training methods

The LIMI approach works across different model sizes. LIMI-Air, with 106 billion parameters, improved from 17.0 percent to 34.3 percent; the larger LIMI, with 355 billion parameters, jumped from 45.1 percent to 73.5 percent.

The results could change how autonomous AI systems are developed. While traditional methods rely on ever-larger training pipelines and massive datasets, LIMI points toward a different path. This approach looks promising, but it will need more research and real-world testing before it can become the new standard. The code, models, and datasets are all public.

Recommendation

Nvidia researchers have recently argued that most AI agents use language models that are far too large, and that models under 10 billion parameters may be sufficient for agentic tasks. LIMI's results back this up, offering empirical evidence that careful data curation can beat brute-force scaling.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • LIMI is a new training method from Chinese researchers that achieves 73.5% on complex tasks in the AgencyBench benchmark using just 78 carefully selected training examples, outperforming previous open-weight models that required tens of thousands of examples.
  • LIMI surpasses models like GLM-4.5, Deepseek-V3.1, and Kimi-K2 in areas such as software development, scientific workflows, and coding benchmarks, reaching a 74.6% overall success rate compared to GLM-4.5's 47.4%, despite much less training data.
  • The study demonstrates that strategic selection of training data, rather than sheer volume, makes developing autonomous AI agents more efficient; the findings support the use of smaller, purpose-trained AI models for agent tasks.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.