Content
summary Summary

Chinese researchers have discovered why AI models often struggle with complex reasoning tasks: They tend to drop promising solutions too quickly, leading to wasted computing power and lower accuracy.

Ad

Researchers from Tencent AI Lab, Soochow University, and Shanghai Jiao Tong University show that reasoning models like OpenAI's o1 frequently jump between different problem-solving approaches, often starting fresh with phrases like "Alternatively…" This behavior becomes more noticeable as tasks get harder, with models using more computing power when they arrive at wrong answers.

Example text: The QwQ-32B model shows 25 reasoning approaches during a task,
The QwQ-32B-Preview model tries 25 different solution approaches during a single task. Frequent strategy changes lead to inefficient use of resources. | Image: Wang et al.

The team found that 70 percent of incorrect answers contained at least one valid line of reasoning that wasn't fully explored. When models gave wrong answers, they used 225 percent more computing tokens and changed strategies 418 percent more often compared to correct solutions.

Two bar charts: Comparison of the generated tokens and number of “Thoughts” for different models.
The number of tokens generated and the number of "thoughts" (solution approaches) for different models. On average, o1-like LLMs use 225 percent more tokens for incorrect answers than for correct ones, which is due to 418 percent more frequent thought changes. | Image: Wang et al.

To track this problem, the researchers created a metric that measures how efficiently models use their computing tokens when they get answers wrong. Specifically, they looked at how many tokens actually contribute to finding the right solution before the model switches to a different approach.

Ad
Ad

The team tested this using three challenging sets of problems: math competition questions, college physics problems, and chemistry tasks. They wanted to see how models like QwQ-32B-Preview and Deepseek-R1-671B handle complex reasoning. The results showed that o1-style models often waste tokens by jumping between different approaches too quickly. Surprisingly, models that get more answers right don't necessarily use their tokens more efficiently.

Table: Underthinking scores and accuracy of various models on test sets MATH500, GPQA, AIME24.
Underthinking scores (UT) of different models in logic tasks. The UT score measures the frequency of strategy changes during the reasoning process. | Image: Wang et al.

Making models stick to their ideas

To address underthinking, the research team developed what they call a "thought switching penalty" (TIP). It works by adjusting the probability scores for certain tokens - the building blocks models use to form responses.

When the model considers using words that signal a strategy change, like "Alternatively", TIP punishes these choices by reducing their likelihood. This pushes the model to explore its current line of reasoning more thoroughly before jumping to a different approach.

When using TIP, the QwQ-32B-Preview model solved more MATH500-Hard problems correctly - improving from 82.8 to 84.3 percent accuracy - and showed more consistent reasoning. The team saw similar improvements when they tried it on other tough problem sets like GPQA Diamond and AIME2024.

Table: TIP decoding results, showing improved accuracy and lower underthinking scores for various models.
The table shows the results of the proposed decoding with "Thought Switching Penalty". The TIP method can increase the accuracy and decrease the UT score, indicating a slight improvement in the efficiency and stability of the reasoning process. | Image: Wang et al.

These results point to something interesting: getting AI to reason well isn't just about having more computing power. Models also need to learn when to stick with a promising idea. Looking ahead, the research team wants to find ways for models to manage their own problem-solving approach better - knowing when to keep going with an idea and when it's actually time to try something new.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from China have discovered that reasoning models like OpenAI's o1 often struggle with complex tasks due to "underthinking", where they abandon promising solutions too quickly, leading to inefficient resource use and lower accuracy.
  • The study found that when models gave incorrect answers, they used 225% more computing tokens and changed strategies 418% more often compared to when they provided correct solutions, with 70% of wrong answers containing at least one valid line of reasoning that wasn't fully explored.
  • To address this issue, the team developed a "thought switching penalty" (TIP) method that discourages models from abandoning their current approach too quickly, which improved accuracy across various math and science problems without requiring extensive model modifications.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.