Deepseek's hybrid reasoning model V3.1-Terminus delivers higher scores on tool-based agent tasks

Deepseek has rolled out V3.1-Terminus, an improved version of its hybrid AI model Deepseek-V3.1.

V3.1-Terminus now does a better job distinguishing between Chinese and English, and eliminates errors like random special characters. Deepseek has also tweaked its built-in agents, including code and search agents, for more reliable results, the company says.

Benchmark results show the biggest gains in tasks that require tool use. On the BrowseComp benchmark, V3.1-Terminus jumps from 30.0 to 38.5 points. On Terminal-bench, it goes from 31.3 to 36.7.

Deepseek's chart also indicates a tradeoff: performance improves on the English-language BrowseComp, while BrowseComp-ZH on the Chinese web slips slightly. For pure reasoning tasks without tool use, the improvements are more modest.

Tabular comparison of DeepSeek V3.1 vs. V3.1 Terminus in reasoning and tool benchmarks; Terminus significantly increases tool scores. — V3.1-Terminus posts larger gains on agent tasks that use external tools. Scores climb on BrowseComp, while BrowseComp-ZH dips a bit, hinting at a tradeoff between English- and Chinese-web performance. BrowseComp measures multi-step live web searches. | Image: Deepseek

The model is available through app, web, and API. Open-source weights can be found on Hugging Face under an MIT license.

Two thinking modes and aggressive pricing

V3.1-Terminus builds on Deepseek-V3.1, first released in August, which introduced two separate modes: a "thinking" mode (Deepseek-reasoner) for complex, tool-based tasks, and a "non-thinking" mode (Deepseek-chat) for straightforward conversations. Both modes support a context window of up to 128,000 tokens.

The model was trained on an additional 840 billion tokens, using a new tokenizer and updated prompt templates. Deepseek-V3.1 has already posted strong results against hybrid models from OpenAI and Anthropic, and outperformed Deepseek's own pure reasoning model R1.

Deepseek has kept its aggressive pricing from the initial release: output tokens still cost $1.68 per million, well below GPT-5 ($10.00) and Claude Opus 4.1 (up to $75.00). The API charges $0.07 per million tokens for cache hits and $0.56 for cache misses.

Like other Chinese AI models, Deepseek's latest release is subject to state censorship, making it a propaganda tool for the Chinese government on political topics. The Trump administration has proposed similar restrictions for US-based models. According to a recent Deepseek code review, these interventions can directly impact model performance.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI in practice

Deepseek's hybrid reasoning model V3.1-Terminus delivers higher scores on tool-based agent tasks

Two thinking modes and aggressive pricing

Nvidia positions GR00T N1 to dominate robotics ecosystem

Deepseek's OCR system compresses image-based text so AI can handle much longer documents

Deepseek slashes API prices by up to 75 percent with its latest V3.2 model

Deepseek says training its R1 model cost just $294,000

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

German court deepens the split on AI and copyright with its latest ruling

Deepseek's hybrid reasoning model V3.1-Terminus delivers higher scores on tool-based agent tasks

Two thinking modes and aggressive pricing

Share

Bank details