Content
summary Summary

Nvidia researchers say the AI industry is too focused on oversized large language models (LLMs) for agent systems, a strategy they argue is both economically and environmentally unsustainable.

Ad

In a recent paper, they suggest most agents could run just as well on smaller language models (SLMs) and urge companies to rethink their approach.

The market for LLM APIs that power agent systems was valued at $5.6 billion in 2024, but cloud infrastructure spending for these systems hit $57 billion, a 10-to-1 gap. "This operational model is deeply ingrained in the industry — so deeply ingrained, in fact, that it forms the foundation of substantial capital bets," they write.

SLMs, which they define as models under 10 billion parameters, are, in their view, "principally sufficiently powerful," "inherently more operationally suitable," and "necessarily more economical" for most agent workloads.

Ad
Ad
Flowchart: Comparison of two AI agent architectures with direct tool connection vs. controller-based approach.
Two ways to control AI agents: On the left, the language model manages both user interaction and tool orchestration. On the right, a dedicated controller separates orchestration from the user interface, allowing for more structured workflows. | Image: Nvidia

The researchers argue that smaller models can often match or beat much larger ones. They cite Phi-2 from Microsoft, which they say rivals 30-billion-parameter LLMs in reasoning and code while running 15 times faster. Nvidia’s Nemotron-H models, with up to 9 billion parameters, reportedly deliver similar accuracy to 30-billion-parameter LLMs using far less compute. They also claim Deepseek-R1-Distill-Qwen-7B and DeepMind’s RETRO match or outperform much larger proprietary models on key tasks.

The economics lean small

Nvidia's researchers say the math favors SLMs. Running a 7‑billion‑parameter model costs 10 to 30 times less than operating a 70‑ to 175‑billion‑parameter LLM, factoring in latency, energy use, and compute requirements. Fine‑tuning can be done in a few GPU hours instead of taking weeks, making small models much faster to adapt. Many can also run locally on consumer hardware, which cuts latency and gives users more control over their data.

The team also claims SLMs use their parameters more efficiently, while large models often activate only a small fraction for a given input—an inefficiency they see as built‑in. They argue that AI agents rarely need the full range of capabilities an LLM provides. "An AI agent is essentially a heavily instructed and externally choreographed gateway to a language model," they write.

Most agent tasks are repetitive, narrowly scoped, and not conversational, which makes specialized SLMs fine‑tuned for those formats a better fit. Their recommendation is to build heterogeneous agent systems that rely on SLMs by default, reserving larger models for situations that truly require complex reasoning.

Why SLMs aren't taking over

According to Nvidia's team, the biggest barriers are the industry's heavy investment in centralized LLM infrastructure, its focus on broad benchmark scores, and the lack of public awareness about how capable small models have become.

Recommendation

They lay out a six‑step plan for making the shift: collect data, filter and curate it, cluster tasks, pick the right SLM, fine‑tune it for specific needs, and keep improving over time. In case studies, they found that 40 to 70 percent of LLM queries in open‑source agents like MetaGPT, Open Operator, and Cradle could be handled just as well by SLMs.

The researchers observe that for many, the shift to SLMs represents "not only a technical refinement but also a Humean moral ought" in light of rising costs and the environmental toll of large‑scale infrastructure. Mistral recently supported that view when it released detailed data on the energy consumption of its largest models.

It might seem odd for Nvidia, one of the biggest beneficiaries of the LLM boom, to make this argument. But pushing smaller, cheaper models could grow the overall AI market and help embed the technology more deeply across businesses and consumer devices. Nvidia is seeking feedback from the community and plans to publish selected responses online.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Nvidia caution that the current industry emphasis on large language models for AI agents is both costly and harmful to the environment and instead suggest a stronger focus on smaller models with fewer than 10 billion parameters.
  • Models like Microsoft's Phi, Nvidia's Nemotron-H family, and Deepseek-R1-Distill-Qwen-7B are cited as proof that smaller models can match or surpass the performance of much larger systems in many areas while using less energy and money.
  • The Nvidia team urges a shift toward agent systems based on smaller models, arguing this approach is not only more economical but also a moral imperative due to increasing infrastructure expenses and environmental impact.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.