summary Summary

Chinese AI startup DeepSeek recently released DeepSeek-V2, a large Mixture-of-Experts (MoE) language model that aims to achieve an optimal balance between high performance, lower training cost, and efficient inference.

The open-source model boasts 236 billion parameters and supports a context length of 128,000 tokens. Compared to its predecessor, DeepSeek 67B, DeepSeek-V2 saves 42.5 percent in training costs, reduces the key-value cache by 93.3 percent, and increases the maximum generation throughput by 5.76 times.

DeepSeek achieves high performance on benchmarks while requiring significantly fewer parameters for inference than many of its competitors. | Image: DeepSeek

It is based on the proven Transformer architecture, but with significant innovations in the attention module and feed-forward network. The model uses two key techniques: Multi-Head Latent Attention (MLA) and the DeepSeekMoE architecture.

MLA compresses keys and values together to reduce memory and increase processing speed by storing essential information in a more compact format. The DeepSeekMoE architecture specializes individual experts and avoids redundancy by breaking complex tasks into smaller subtasks that are handled by specialized experts rather than a single large model.


The researchers trained DeepSeek-V2 on a massive corpus of 8.1 trillion tokens, focusing on high-quality Chinese data. Naturally, the model's strength is its performance on Chinese benchmarks.

The company has also developed two chat variants using Supervised Finetuning (SFT) and Reinforcement Learning (RL): DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL).

The benchmark results published by DeepSeek-AI show that DeepSeek-V2 Chat achieves top performance among open-source models with only 21 billion activated parameters, making it currently the best performing open source MoE language model. It beats Mixtral 8x22B and LLaMA 3-70B in some benchmarks.

Compared to other Mixture-of-Experts models, DeepSeek-V2 in Chat mode outperforms bigger models in English language tasks. | Picture: DeepSeek

In particular, the RL version of DeepSeek-V2 Chat excels at understanding Chinese, performing on par with GPT-4. However, the reasoning ability of the model still falls short of larger models such as Ernie 4.0 and the GPT-4 models.

In Chinese, DeepSeek-V2 can keep up with OpenAI models in a chat mode. | Image: DeepSeek

Open Source and affordable API

Despite being a relatively young startup, founded in 2023, DeepSeek offers a set of large language models for various use cases.


The models are open source and available for free local use. The company also provides a programming interface for easy integration into products without the need for extensive computing power.

DeepSeek's chat API is cheap, charging only $0.14 per million input tokens and $0.28 per million output tokens. By comparison, the next cheapest commercial language model, Claude 3 Haiku, costs $0.25 and $1.25, respectively.

Since last year, DeepSeek has quickly released various models with different parameter sizes for specializations such as coding or mathematics. | Image: Screenshot by THE DECODER

Like other large language models, DeepSeek-V2 has some limitations, such as a lack of recent information in its training data and the presence of hallucinations. But the company plans to continue investing in large open-source models to move closer to its goal of artificial general intelligence (AGI).

For future iterations, DeepSeek aims to further improve cost efficiency and performance using the MoE architecture. While DeepSeek-V2 excels in Chinese language tasks and specific domains such as math and coding, the startup is likely to focus on higher-quality data in English and other languages for its next release, DeepSeek-V3, to better compete with models such as Claude 3 and GPT-4 on a global scale.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

DeepSeek-V2 is available on Hugging Face and can be tried out for free after registering via email or Google account at

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • DeepSeek-V2 is an open source 236 billion parameter language model based on a revised mixture-of-experts architecture that seeks an optimal balance between performance, training cost, and inference efficiency.
  • The model extends the Transformer architecture with innovative techniques. It was trained on a massive corpus of 8.1 trillion tokens, specifically high-quality Chinese data.
  • In benchmarks, DeepSeek-V2 outperforms other open-source models and even reaches the performance of GPT-4 and Ernie 4.0 in some tests, especially in Chinese. However, it lags behind the top models in areas such as inference.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.