Content
summary Summary

A new study from Johns Hopkins University shows how giving AI systems more time to "think" improves their ability to determine when they can and cannot accurately answer questions.

Ad

The research team examined how additional computing time affects AI decision-making and developed a new evaluation framework that addresses current weaknesses in AI assessment methods. Traditional evaluation approaches assume AI models should always provide an answer regardless of their confidence level - a practice the team says doesn't reflect real-world scenarios where incorrect answers could have serious consequences.

The team tested two language models - DeepSeek R1-32B and s1-32B - using 30 math problems from the AIME24 dataset. They varied the available computing time (specifically the number of reasoning tokens) and observed how the models behaved at different confidence thresholds.

The results showed that more "thinking time" improved accuracy but also helped the systems better recognize when they shouldn't attempt an answer. With more time to process, the models developed a clearer sense of which questions they could answer confidently and which ones they couldn't.

Ad
Ad

Different risk scenarios reveal hidden strengths

The study examined three risk scenarios: "Exam Odds" with no penalties for wrong answers, "Jeopardy Odds" with equal weighting of rewards and penalties, and "High-Stakes Odds" with severe penalties for errors in critical decision contexts.

An interesting distinction emerged between the tested models: While both performed similarly under standard conditions, DeepSeek R1-32B performed significantly better under stricter confidence requirements - a difference only revealed through the new testing framework.

The researchers note that their confidence measurement method, which relies solely on token probabilities, might not capture every aspect of model uncertainty. They also acknowledge that by focusing on mathematical problems in English, they may have missed important variations in other domains and languages.

The team recommends that future work on scaling test time should be evaluated under both "Exam Odds" and "Jeopardy Odds" conditions. This more comprehensive evaluation approach would help developers better understand how their systems perform across different risk contexts.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from Johns Hopkins University explored the impact of increased computing time on AI decision-making and created a new evaluation framework that addresses limitations in previous AI assessment methods.
  • Providing AI systems with more "thinking time" improves the accuracy of their responses but also enhances their ability to determine when they should refrain from answering. As computing time increased, the models gained a better understanding of their own limitations.
  • The study highlighted differences between the tested models: Although both performed comparably well under standard conditions, DeepSeek R1-32B significantly outperformed S1-32B when more rigorous confidence requirements were applied - a distinction that only became evident with the new testing framework.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.