Content
summary Summary

A new approach shows that carefully selected training data and flexible test-time compute control can help AI models tackle complex reasoning tasks more efficiently.

Ad

From a pool of nearly 60,000 question-answer pairs, researchers selected just 1,000 high-quality examples that met three key criteria: They needed to be challenging, come from diverse fields, and maintain high standards for clarity and formatting. The examples included thinking steps generated with Gemini 2.0 Flash Thinking.

Streudiagramm: MATH500-Genauigkeit verschiedener KI-Modelle gegen Anzahl der Trainingsbeispiele, s1-Modell führend in Effizienz.
Despite having only a fraction of the examples of other models, s1-32B performs very well in a math benchmark. | Image: Muennighoff et al.

Using this compact but refined dataset, researchers from Stanford University and the Allen Institute for AI trained a medium-sized language model called s1-32B, based on Qwen2.5 with 32 billion parameters.

How 'budget forcing' improves AI reasoning

The model learned from sample solutions which steps and explanations lead to correct answers. Thanks to focused data selection, training took just 26 minutes on 16 Nvidia H100 GPUs - about 25 GPU hours total. While exact figures aren't available for similar models like OpenAI o1 or DeepSeek-R1, they likely require thousands of GPU hours.

Ad
Ad

The team also developed "budget forcing," a method to control the model's thinking process. If the model exceeds a set number of calculation steps, it must provide an answer. When the model needs more time, adding the word "Wait" prompts it to review its previous answer and check its reasoning for errors.

Beispiel für Budget Forcing: KI-Modell zählt 'r' in
Budget forcing as an effective intervention strategy: The insertion of "Wait" prolongs the model's thought process, which leads to a self-correction from 2 to 3 'r'. | Image: Muennighoff et al.

Budget forcing lets users adjust the model's thoroughness as needed. Tests showed that a higher budget, triggered by more frequent "Wait" commands, produced better results. The trained model even outperformed OpenAI's more data-intensive o1-preview and o1-mini on math benchmarks.

Leistungsvergleichstabelle verschiedener KI-Modelle mit Metriken für AIME-2024, MATH-500 und GPQA Diamond, unterteilt in API-only und Open-Source-Varianten.
Compared to other closed and open language models, s1-32B shows its strengths particularly in the area of mathematics.

Further tests revealed that only combining all three data selection criteria - difficulty, variety, and quality - delivered optimal performance. Limiting selection to individual criteria or choosing randomly led to up to 30 percent worse results.

Interestingly, even the 59 times larger complete dataset didn't improve upon the carefully chosen 1,000 examples. Budget control proved more crucial, allowing precise management of test-time compute and showing a clear link between tokens invested and performance.

Streudiagramm: Korrelation zwischen durchschnittlicher Denkzeit (Tokens) und Genauigkeit bei mathematischen Wettbewerbsaufgaben, steigende Tendenz.
Extending the thinking time by inserting "wait" commands leads to a significant improvement in mathematical problem-solving ability. | Image: Muennighoff et al.

The study shows that a small but well-chosen training dataset can prepare language models for complex reasoning tasks. Combined with flexible test-time compute, models can work more thoroughly when needed without increasing their size.

Recommendation

While s1-32B and budget forcing show promise, the benchmark results only reflect performance in a narrow set of skills. The researchers have shared their code and training data on GitHub to encourage further development.

Many research teams have tried to match leading AI models in complex reasoning using increasingly large datasets. OpenAI recently added its latest reasoning model o3-mini to ChatGPT. However, Chinese company DeepSeek has shown that competitive models come from using resources efficiently and implementing good ideas - budget forcing might be one of them.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers trained an AI model called s1-32B on a carefully selected dataset of 1,000 question-answer pairs, focusing on difficulty, diversity, and clarity.
  • They developed "budget forcing," a method to control the model's thinking process by setting limits on calculation steps and prompting longer thinking time when needed using a simple "wait" prompt input.
  • The study shows that selective data and flexible compute management can enable efficient complex reasoning in AI models, with s1-32B outperforming larger models on math benchmarks.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.