Ad
Skip to content

Deepseek's hybrid reasoning model V3.1-Terminus delivers higher scores on tool-based agent tasks

Image description
Sora prompted by THE DECODER

Key Points

  • Deepseek has released V3.1-Terminus, an updated hybrid AI model that delivers more consistent results across languages and achieves notable gains on tool usage benchmarks like BrowseComp and Terminal-bench.
  • The model maintains its two operating modes—one optimized for complex tasks involving tools and another for straightforward conversations—and can process up to 128,000 tokens in a single context.
  • With a price of $1.68 per million output tokens, V3.1-Terminus is significantly less expensive than similar offerings from OpenAI and Anthropic, and its open-source weights are available on Hugging Face.

Deepseek has rolled out V3.1-Terminus, an improved version of its hybrid AI model Deepseek-V3.1.

V3.1-Terminus now does a better job distinguishing between Chinese and English, and eliminates errors like random special characters. Deepseek has also tweaked its built-in agents, including code and search agents, for more reliable results, the company says.

Benchmark results show the biggest gains in tasks that require tool use. On the BrowseComp benchmark, V3.1-Terminus jumps from 30.0 to 38.5 points. On Terminal-bench, it goes from 31.3 to 36.7.

Deepseek's chart also indicates a tradeoff: performance improves on the English-language BrowseComp, while BrowseComp-ZH on the Chinese web slips slightly. For pure reasoning tasks without tool use, the improvements are more modest.

Ad
DEC_D_Incontent-1

Tabular comparison of DeepSeek V3.1 vs. V3.1 Terminus in reasoning and tool benchmarks; Terminus significantly increases tool scores.
V3.1-Terminus posts larger gains on agent tasks that use external tools. Scores climb on BrowseComp, while BrowseComp-ZH dips a bit, hinting at a tradeoff between English- and Chinese-web performance. BrowseComp measures multi-step live web searches. | Image: Deepseek

The model is available through app, web, and API. Open-source weights can be found on Hugging Face under an MIT license.

Two thinking modes and aggressive pricing

V3.1-Terminus builds on Deepseek-V3.1, first released in August, which introduced two separate modes: a "thinking" mode (Deepseek-reasoner) for complex, tool-based tasks, and a "non-thinking" mode (Deepseek-chat) for straightforward conversations. Both modes support a context window of up to 128,000 tokens.

The model was trained on an additional 840 billion tokens, using a new tokenizer and updated prompt templates. Deepseek-V3.1 has already posted strong results against hybrid models from OpenAI and Anthropic, and outperformed Deepseek's own pure reasoning model R1.

Deepseek has kept its aggressive pricing from the initial release: output tokens still cost $1.68 per million, well below GPT-5 ($10.00) and Claude Opus 4.1 (up to $75.00). The API charges $0.07 per million tokens for cache hits and $0.56 for cache misses.

Ad
DEC_D_Incontent-2

Like other Chinese AI models, Deepseek's latest release is subject to state censorship, making it a propaganda tool for the Chinese government on political topics. The Trump administration has proposed similar restrictions for US-based models. According to a recent Deepseek code review, these interventions can directly impact model performance.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.