AI systems develop a sense of their own limitations with more time to "think"

Midjourney prompted by THE DECODER

A new study from Johns Hopkins University shows how giving AI systems more time to "think" improves their ability to determine when they can and cannot accurately answer questions.

The research team examined how additional computing time affects AI decision-making and developed a new evaluation framework that addresses current weaknesses in AI assessment methods. Traditional evaluation approaches assume AI models should always provide an answer regardless of their confidence level - a practice the team says doesn't reflect real-world scenarios where incorrect answers could have serious consequences.

The team tested two language models - DeepSeek R1-32B and s1-32B - using 30 math problems from the AIME24 dataset. They varied the available computing time (specifically the number of reasoning tokens) and observed how the models behaved at different confidence thresholds.

The results showed that more "thinking time" improved accuracy but also helped the systems better recognize when they shouldn't attempt an answer. With more time to process, the models developed a clearer sense of which questions they could answer confidently and which ones they couldn't.

Different risk scenarios reveal hidden strengths

The study examined three risk scenarios: "Exam Odds" with no penalties for wrong answers, "Jeopardy Odds" with equal weighting of rewards and penalties, and "High-Stakes Odds" with severe penalties for errors in critical decision contexts.

An interesting distinction emerged between the tested models: While both performed similarly under standard conditions, DeepSeek R1-32B performed significantly better under stricter confidence requirements - a difference only revealed through the new testing framework.

The researchers note that their confidence measurement method, which relies solely on token probabilities, might not capture every aspect of model uncertainty. They also acknowledge that by focusing on mathematical problems in English, they may have missed important variations in other domains and languages.

The team recommends that future work on scaling test time should be evaluated under both "Exam Odds" and "Jeopardy Odds" conditions. This more comprehensive evaluation approach would help developers better understand how their systems perform across different risk contexts.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI systems develop a sense of their own limitations with more time to "think"

Different risk scenarios reveal hidden strengths

Google says AI content is fine, and SEO basics still apply to AI-powered search

Yet another study finds that overloading LLMs with information leads to worse results

New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

AI systems develop a sense of their own limitations with more time to "think"

Different risk scenarios reveal hidden strengths

Google says AI content is fine, and SEO basics still apply to AI-powered search

Yet another study finds that overloading LLMs with information leads to worse results

New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking