DeepConf can greatly reduce computational effort in language model reasoning tasks

Meta and UC San Diego have introduced DeepConf (Deep Think with Confidence), a new inference method designed to make mathematical reasoning in language models faster and more accurate.

So-called reasoning language models usually break down tough problems by generating multiple solution paths, then picking the answer that comes up the most. But every path gets equal weight, even when some are clearly wrong. That means a weak but frequent solution can win out, while each extra path adds computational cost without always improving the answer.

How models show their uncertainty

DeepConf tackles this by measuring how sure the model is about each prediction. When the model puts most of its probability on a single next word, it's signaling confidence in that path. If it's unsure, the probability gets spread out over many options. The more focused the probabilities, the higher the model's confidence. The research team found that these high-confidence paths are much more likely to be correct.

A graphic divided into two parts. On the left, various methods for calculating confidence in texts are explained, such as — Left: Metrics either check the whole response or zoom in on specific sections, like the conclusion. Right: DeepConf's two-stage process first filters out low-confidence paths, then picks the final answer by weighted voting among the strongest candidates. | Image: Fu et al.

Most older methods just averaged confidence across the entire reasoning chain. DeepConf takes it further by analyzing individual sections, making it easier to spot and remove weak links or error-prone segments.

Two operating modes

DeepConf comes with two modes. In offline mode, it generates all reasoning paths up front, then filters or down-weights the low-quality ones before choosing a final answer. In online mode, DeepConf checks confidence as each solution path is generated and stops early if confidence drops below a certain threshold. That threshold is set using 16 reference paths. The aggressive version benchmarks against the top 10 percent, while the conservative one uses the top 90 percent.

A flowchart illustrating DeepConf's online mode. Multiple solution paths are generated, but some are marked with a red stop sign and terminated because their confidence falls below a threshold. This saves the complete generation of poor-quality solution paths. — When the model says things like "Wait, let me double check," the calculated confidence drops. If it dips below the threshold (s), DeepConf cuts off that solution path instead of letting it finish. | Image: Fu et al.

The researchers put DeepConf to the test on five open-source models, from Deepseek-R1-8B up to gpt-oss-120B, using math competitions like AIME24/25, HMMT25, and BRUMO25, along with scientific reasoning tasks.

On AIME 2025, DeepConf reached 99.9 percent accuracy in offline mode with gpt-oss-120B. In the leaner online mode, it still hit 97.9 percent accuracy and slashed token usage by 84.7 percent compared to regular majority voting.

Four line graphs plotting accuracy (Y-axis) against token consumption (X-axis) for four different benchmarks. In each graph, the two green lines for the DeepConf methods achieve high accuracy with significantly lower token consumption than the brown line for standard tuning. The green curves are therefore located in the advantageous upper left area of the graphs. — These charts show accuracy versus computational cost. DeepConf (green) hits top accuracy while using far fewer tokens than majority voting (brown), consistently outperforming the baseline. | Image: Fu et al.

Each experiment was run 64 times to ensure the results were statistically solid. In math tasks, the aggressive setting cut token usage by as much as 84.7 percent, while the conservative mode saved up to 59 percent, typically without sacrificing accuracy. These reductions are measured across all tokens generated in every run, so the savings are especially noticeable when lots of weak solution paths get stopped early.

DeepConf doesn't require extra model training and can be dropped into systems like vLLM with just a few lines of code.

Recommendation

AI research

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Limitations and outlook

There are some limits. If a model is very confident in a wrong answer, DeepConf might not filter it out, especially in aggressive mode. The researchers recommend the conservative version for more stable results, even if it's a bit less efficient. The code is available on GitHub.

Reasoning models have become the go-to for getting reliable answers from AI. OpenAI, for example, routes harder questions to a special "thinking" mode in GPT-5, though this switch doesn't always work as intended.

Some studies now question if investing in "thinking" models is worth it, especially with rising energy costs. Approaches like DeepConf, which match or beat standard accuracy with much less computation, could play a key role in the future of language models.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

DeepConf can greatly reduce computational effort in language model reasoning tasks

How models show their uncertainty

Two operating modes

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Limitations and outlook

Yet another study doubts that LLM reasoning shows true logic over pattern imitation

Apple study finds "a fundamental scaling limitation" in reasoning models' thinking abilities

Wait a minute! Researchers say AI's "chains of thought" are not signs of human-like reasoning

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

German court deepens the split on AI and copyright with its latest ruling

DeepConf can greatly reduce computational effort in language model reasoning tasks

How models show their uncertainty

Two operating modes

Limitations and outlook

Share

Bank details