Ad
Skip to content

As agentic AI pushes rivals to raise prices and cap usage, Deepseek V4 is a good-enough model for almost nothing

Image description
Nano Banana Pro prompted by THE DECODER

Key Points

  • Chinese AI lab Deepseek has released V4-Pro and V4-Flash as open-weight models with up to 1.6 trillion parameters and a one-million-token context window.
  • A new architecture dramatically cuts the compute required for long contexts, letting Deepseek price both models well below competitors like OpenAI, Google, and Anthropic.
  • The models were trained on up to 33 trillion tokens and refined through distillation from in-house specialist models. They're built specifically for agentic tasks and run on both Nvidia GPUs and Huawei's Ascend chips.

Chinese AI lab Deepseek has released V4-Pro and V4-Flash, two new models with up to 1.6 trillion parameters and a one-million-token context window. Pricing sits well below OpenAI, Google, and Anthropic. The accompanying technical paper also reveals details about training data, distillation, and hardware.

Deepseek has published preview versions of V4-Pro and V4-Flash as open weights under the MIT license. V4-Pro has 1.6 trillion total parameters with 49 billion active, while V4-Flash comes in at 284 billion total with 13 billion active. Both are mixture-of-experts models with a one-million-token context window. Both are available on Hugging Face.

V4-Pro is now the largest open-weights model available, surpassing Kimi K2.6 (1.1 trillion) and GLM-5.1 (754 billion) by a wide margin. It's also Deepseek's first new architecture since V3. Every model released in between - V3.1, V3.2, R1, and R1 0528 - was still built on the original V3 design with 685 billion parameters.

Long contexts now require far less compute

The key innovation is a new hybrid attention architecture that combines token compression with Deepseek's sparse attention. According to the technical report, V4-Pro needs just 27 percent of the FLOPs and 10 percent of the KV cache compared to V3.2 when processing a one-million-token context. V4-Flash pushes those numbers even lower - down to 10 percent of the FLOPs and 7 percent of the KV cache.

Ad
DEC_D_Incontent-1

On Artificial Analysis's GDPval-AA benchmark, V4-Pro leads all open-weights models with 1,554 Elo points, ahead of GLM-5.1 (1,535) and Kimi K2.6 (1,484). That's a jump of roughly 355 Elo points over V3.2. Deepseek acknowledges in the paper, though, that V4-Pro "falls slightly behind GPT-5.4 and Gemini-3.1-Pro" and trails frontier models by about three to six months. Full testing by Artificial Analysis is still underway, but some of Deepseek's own benchmark results show the gap. OpenAI and Anthropic have since released new models with GPT-5.5 and Opus 4.7.

These efficiency gains explain the aggressive pricing. V4-Flash costs just $0.14 per million input tokens and $0.28 per million output tokens according to Deepseek's pricing page, making it cheaper than OpenAI's GPT-5.4 Nano. V4-Pro comes in at $1.74 and $3.48, significantly undercutting Gemini 3.1 Pro, GPT-5.5, and Claude Sonnet 4.6.

Model Input ($/M) Output ($/M)
Deepseek V4 Flash 0,14 0,28
Deepseek V4 Pro 1,74 3,48
GPT-5.4 2,50 15
GPT-5.5 5 30
Claude Sonnet 4.6 3 15
Claude Opus 4.6 5 25
Claude Opus 4.7 5 25

Training relies on massive data and in-house distillation

The team is relatively vague about the pre-training corpus: V4-Flash saw 32 trillion tokens, V4-Pro 33 trillion. The focus was on more multilingual data, carefully curated scientific papers and technical reports, and agentic data during mid-training. Web data was filtered against "batched auto-generated and templated content."

Ad
DEC_D_Incontent-2

The paper doesn't name specific datasets or license sources. The frequently raised suspicion that Deepseek distills directly from GPT or Claude finds no confirmation in the report, unsurprisingly.

Distillation does play a central role in post-training, though. Deepseek has completely replaced its previous mixed reinforcement learning phase with on-policy distillation. According to the paper, the lab first trains more than ten specialized in-house models for math, code, agents, and instruction following using supervised fine-tuning and GRPO. A single student model then learns from all of these in-house teachers.

Models optimized for agentic tasks, validated on Huawei hardware

Deepseek built V4 specifically for agentic workflows. The company says the models are integrated with tools like Claude Code, OpenClaw, and OpenCode, and are already being used internally for agentic coding. The API supports both OpenAI- and Anthropic-compatible interfaces.

The paper is more specific about hardware: the expert parallelism scheme has been validated on "Nvidia GPUs and Huawei Ascend NPUs." The open-source mega-kernel MegaMoE is CUDA-based, and Deepseek replaced Nvidia's cuBLAS library with its own DeepGEMM.

Separately, Huawei has announced that its Ascend Supernode, built on Ascend 950 AI chips, fully supports the V4 models.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.