Study shows: 'Test-time compute scaling' is a path to better AI systems

Midjourney prompted by THE DECODER

Researchers at Hugging Face have demonstrated significant performance improvements in open-source language models by intelligently scaling compute during inference, drawing inspiration from OpenAI's o1 model. Their approach combines various search strategies with reward models.

While scaling computing resources during pre-training and has been crucial for developing large language models (LLMs) in recent years, the required resources are becoming increasingly expensive, leading researchers to explore alternative approaches. According to Hugging Face researchers, scaling computing power during inference offers a promising solution by using dynamic inference strategies that allow models to spend more time processing complex tasks.

Although "test-time compute scaling" isn't new and has been a key factor in the success of AI systems like AlphaZero, OpenAI's o1 was the first to clearly demonstrate that language model performance can be significantly improved by allowing more time to "think" about difficult tasks. However, there are several possible approaches to implementation, and which one OpenAI uses remains unknown.

From basic to complex search strategies

The scientists examined three main search-based approaches. The "Best-of-N" method generates multiple solution proposals and selects the best one. Beam Search systematically explores the solution space using a Process Reward Model (PRM). The newly developed "Diverse Verifier Tree Search" (DVTS) additionally optimizes the diversity of solutions found.

The practical test results are impressive: A Llama model with just one billion parameters matched the performance of a model eight times larger. In mathematical tasks, it achieved an accuracy of nearly 55 percent—which Hugging Face says approaches the average performance of computer science PhD students.

A 3-billion-parameter model even outperformed the 22-times-larger 70-billion-parameter Llama 3.1, thanks to the team's proposed optimized computation methods that select the best search strategy for each computing budget.

In both cases, the team compared the results of the smaller models using the inference methods against those of the larger models without these methods.

Verifiers play a key role

Verifiers or reward models play a central role in all these approaches. They evaluate the quality of generated solutions and guide the search toward promising candidates. However, according to the team, benchmarks like ProcessBench show that current verifiers still have weaknesses, particularly regarding robustness and generalizability.

Improving verifiers is therefore an important starting point for future research, but the ultimate goal is a model that can autonomously verify its own outputs—which the team suggests OpenAI's o1 appears to do.

Recommendation

AI research

AI language models struggle to connect the dots in long texts, study finds

More information and some tools used are available on Hugging Face.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Study shows: 'Test-time compute scaling' is a path to better AI systems

From basic to complex search strategies

Verifiers play a key role

AI language models struggle to connect the dots in long texts, study finds

Nvidia researchers urge the AI industry to rethink agentic AI in favor of smaller, more efficient LLMs

Yet another study doubts that LLM reasoning shows true logic over pattern imitation

Bytedance shows off diffusion code model that's up to 5.4 times faster than previous models

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Study shows: 'Test-time compute scaling' is a path to better AI systems

From basic to complex search strategies

Verifiers play a key role

Share

Bank details