Ad
Skip to content

AI systems develop a sense of their own limitations with more time to "think"

Image description
Midjourney prompted by THE DECODER

Key Points

  • Researchers from Johns Hopkins University explored the impact of increased computing time on AI decision-making and created a new evaluation framework that addresses limitations in previous AI assessment methods.
  • Providing AI systems with more "thinking time" improves the accuracy of their responses but also enhances their ability to determine when they should refrain from answering. As computing time increased, the models gained a better understanding of their own limitations.
  • The study highlighted differences between the tested models: Although both performed comparably well under standard conditions, DeepSeek R1-32B significantly outperformed S1-32B when more rigorous confidence requirements were applied - a distinction that only became evident with the new testing framework.

A new study from Johns Hopkins University shows how giving AI systems more time to "think" improves their ability to determine when they can and cannot accurately answer questions.

The research team examined how additional computing time affects AI decision-making and developed a new evaluation framework that addresses current weaknesses in AI assessment methods. Traditional evaluation approaches assume AI models should always provide an answer regardless of their confidence level - a practice the team says doesn't reflect real-world scenarios where incorrect answers could have serious consequences.

The team tested two language models - DeepSeek R1-32B and s1-32B - using 30 math problems from the AIME24 dataset. They varied the available computing time (specifically the number of reasoning tokens) and observed how the models behaved at different confidence thresholds.

The results showed that more "thinking time" improved accuracy but also helped the systems better recognize when they shouldn't attempt an answer. With more time to process, the models developed a clearer sense of which questions they could answer confidently and which ones they couldn't.

Ad
DEC_D_Incontent-1

Different risk scenarios reveal hidden strengths

The study examined three risk scenarios: "Exam Odds" with no penalties for wrong answers, "Jeopardy Odds" with equal weighting of rewards and penalties, and "High-Stakes Odds" with severe penalties for errors in critical decision contexts.

An interesting distinction emerged between the tested models: While both performed similarly under standard conditions, DeepSeek R1-32B performed significantly better under stricter confidence requirements - a difference only revealed through the new testing framework.

The researchers note that their confidence measurement method, which relies solely on token probabilities, might not capture every aspect of model uncertainty. They also acknowledge that by focusing on mathematical problems in English, they may have missed important variations in other domains and languages.

The team recommends that future work on scaling test time should be evaluated under both "Exam Odds" and "Jeopardy Odds" conditions. This more comprehensive evaluation approach would help developers better understand how their systems perform across different risk contexts.

Ad
DEC_D_Incontent-2

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Arxiv