Content
summary Summary

Meta and UC San Diego have introduced DeepConf (Deep Think with Confidence), a new inference method designed to make mathematical reasoning in language models faster and more accurate.

Ad

So-called reasoning language models usually break down tough problems by generating multiple solution paths, then picking the answer that comes up the most. But every path gets equal weight, even when some are clearly wrong. That means a weak but frequent solution can win out, while each extra path adds computational cost without always improving the answer.

How models show their uncertainty

DeepConf tackles this by measuring how sure the model is about each prediction. When the model puts most of its probability on a single next word, it's signaling confidence in that path. If it's unsure, the probability gets spread out over many options. The more focused the probabilities, the higher the model's confidence. The research team found that these high-confidence paths are much more likely to be correct.

A graphic divided into two parts. On the left, various methods for calculating confidence in texts are explained, such as
Left: Metrics either check the whole response or zoom in on specific sections, like the conclusion. Right: DeepConf's two-stage process first filters out low-confidence paths, then picks the final answer by weighted voting among the strongest candidates. | Image: Fu et al.

Most older methods just averaged confidence across the entire reasoning chain. DeepConf takes it further by analyzing individual sections, making it easier to spot and remove weak links or error-prone segments.

Ad
Ad

Two operating modes

DeepConf comes with two modes. In offline mode, it generates all reasoning paths up front, then filters or down-weights the low-quality ones before choosing a final answer. In online mode, DeepConf checks confidence as each solution path is generated and stops early if confidence drops below a certain threshold. That threshold is set using 16 reference paths. The aggressive version benchmarks against the top 10 percent, while the conservative one uses the top 90 percent.

A flowchart illustrating DeepConf's online mode. Multiple solution paths are generated, but some are marked with a red stop sign and terminated because their confidence falls below a threshold. This saves the complete generation of poor-quality solution paths.
When the model says things like "Wait, let me double check," the calculated confidence drops. If it dips below the threshold (s), DeepConf cuts off that solution path instead of letting it finish. | Image: Fu et al.

The researchers put DeepConf to the test on five open-source models, from Deepseek-R1-8B up to gpt-oss-120B, using math competitions like AIME24/25, HMMT25, and BRUMO25, along with scientific reasoning tasks.

On AIME 2025, DeepConf reached 99.9 percent accuracy in offline mode with gpt-oss-120B. In the leaner online mode, it still hit 97.9 percent accuracy and slashed token usage by 84.7 percent compared to regular majority voting.

Four line graphs plotting accuracy (Y-axis) against token consumption (X-axis) for four different benchmarks. In each graph, the two green lines for the DeepConf methods achieve high accuracy with significantly lower token consumption than the brown line for standard tuning. The green curves are therefore located in the advantageous upper left area of the graphs.
These charts show accuracy versus computational cost. DeepConf (green) hits top accuracy while using far fewer tokens than majority voting (brown), consistently outperforming the baseline. | Image: Fu et al.

Each experiment was run 64 times to ensure the results were statistically solid. In math tasks, the aggressive setting cut token usage by as much as 84.7 percent, while the conservative mode saved up to 59 percent, typically without sacrificing accuracy. These reductions are measured across all tokens generated in every run, so the savings are especially noticeable when lots of weak solution paths get stopped early.

DeepConf doesn't require extra model training and can be dropped into systems like vLLM with just a few lines of code.

Recommendation

Limitations and outlook

There are some limits. If a model is very confident in a wrong answer, DeepConf might not filter it out, especially in aggressive mode. The researchers recommend the conservative version for more stable results, even if it's a bit less efficient. The code is available on GitHub.

Reasoning models have become the go-to for getting reliable answers from AI. OpenAI, for example, routes harder questions to a special "thinking" mode in GPT-5, though this switch doesn't always work as intended.

Some studies now question if investing in "thinking" models is worth it, especially with rising energy costs. Approaches like DeepConf, which match or beat standard accuracy with much less computation, could play a key role in the future of language models.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Meta and UC San Diego have introduced DeepConf, a new approach that relies on language models' internal uncertainty signals to improve the efficiency and precision of mathematical reasoning.
  • DeepConf boosts accuracy to as high as 99.9 percent while cutting the number of tokens used by up to 85 percent, by filtering out weaker reasoning paths early based on the model's confidence.
  • The method achieved consistent savings across five open source models and several benchmarks without requiring extra training, though it struggles when models are highly confident in incorrect answers.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.