Logo

LIMI Study Shows Fewer Training Examples Can Boost AI Agent Performance

Jonathan KemperOriginal Link2 minutes
AutonomousAgents
AITraining
MachineLearning

A new study suggests that only 78 carefully selected training examples may be sufficient to create superior autonomous AI agents, challenging the need for massive datasets.

A groundbreaking study from Chinese researchers challenges the conventional wisdom in AI development, proposing that just 78 meticulously chosen training examples may be enough to build highly effective autonomous agents. The research, detailed in the LIMI ("Less Is More for Intelligent Agency") paper, redefines AI "agency" as the ability to act independently—identifying problems, forming hypotheses, and solving tasks through self-directed interaction with environments and tools.

Key Findings

  • Performance: On the AgencyBench benchmark, LIMI achieved 73.5% accuracy using only 78 training samples, far outpacing competitors like Deepseek-V3.1 (11.9%), Kimi-K2-Instruct (24.1%), and GLM-4.5 (45.1%).
  • Efficiency: LIMI delivered a 53.7% gain over models trained on 10,000 samples, using 128 times less data.
  • Success Rate: LIMI nailed 71.7% of requirements on the first try, nearly doubling the best baseline, with an overall success rate of 74.6%.

Dual plot: Bar chart showing agent performance (LIMI 73.5% on AgencyBench) and efficiency shows 53.7% profit with 0.8% samples.

Methodology

The LIMI team focused on two primary domains:

  1. Collaborative "Vibe Coding": For software development tasks like building C++ chat apps and Java to-do lists.
  2. Research Workflows: For scientific tasks such as LLM comparisons, data analysis, and sports analytics.

Each training example captured the full process of human-AI teamwork, from initial requests through tool use and problem-solving. Some trajectories extended to 152,000 tokens, showcasing the depth of autonomous behaviors learned.

Histogram of trajectory lengths (13 k–152 k tokens, Ø 42.4 k) plus ring diagram showing domain coverage in vibe coding and research workflows.

Implications

The study suggests a paradigm shift in AI training:

  • Smaller Models Can Excel: LIMI-Air (106B parameters) improved from 17.0% to 34.3%, while the larger LIMI (355B parameters) jumped from 45.1% to 73.5%.
  • Quality Over Quantity: Carefully curated data outperforms massive datasets, aligning with recent findings from Nvidia that smaller models may suffice for agentic tasks.

The code, models, and datasets are publicly available, inviting further research and real-world testing.

About the Author

Dr. Sarah Chen

Dr. Sarah Chen

AI Research Expert

A seasoned AI expert with 15 years of research experience, formerly worked at Stanford AI Lab for 8 years, specializing in machine learning and natural language processing. Currently serves as technical advisor for multiple AI companies and regularly contributes AI technology analysis articles to authoritative media like MIT Technology Review.

Expertise

Machine Learning
Natural Language Processing
Deep Learning
AI Ethics
Experience
15 years
Publications
120+
Credentials
3

Agent Newsletter

Get Agentic Newsletter Today

Subscribe to our newsletter for the latest news and updates