Ad
Skip to content

Deepseek OCR 2 cuts visual tokens by 80% and outperforms Gemini 3 Pro on document parsing

Image description
Sora prompted by THE DECODER

Chinese AI company Deepseek has unveiled a new vision encoder that rearranges image information based on meaning rather than processing it in a rigid top-to-bottom, left-to-right pattern.

Traditional vision-language models chop images into small sections and process them in a fixed order, starting from the top left and moving to the bottom right. According to Deepseek's researchers, this approach doesn't match how humans actually see. Our eyes follow flexible patterns based on content. When tracing a spiral, for example, we don't jump across the image line by line. Instead, we follow the shape.

Deepseek OCR 2 tries to mimic that approach. The new DeepEncoder V2 first processes visual tokens based on their content, rearranging them according to context before a language model interprets what's there. The thinking behind it: two processing steps working in sequence could enable genuine understanding of two-dimensional image content.

Diagram showing architecture comparison between DeepEncoder with CLIP and DeepEncoder V2 with LM-based vision encoder.
The comparison shows how DeepEncoder evolved into DeepEncoder V2, replacing the CLIP module with an LLM-based architecture. | Image: Deepseek

Language model replaces traditional vision encoder

At its core, DeepEncoder V2 swaps out the typical CLIP component for a compact language model architecture based on Alibaba's Qwen2 0.5B. The researchers introduced what they call causal flow tokens. These are learnable query tokens that attach to visual tokens and can access all image information along with previous queries.

According to the paper, this creates a two-stage process. First, the encoder reorganizes visual information based on content. Then the downstream LLM decoder reasons over the already sorted sequence. Only the rearranged causal flow tokens get passed to the decoder, not the original visual tokens.

Fewer tokens, better results

Deepseek OCR 2 uses between 256 and 1,120 visual tokens depending on the image. Comparable models often need more than 6,000 or 7,000 tokens. On OmniDocBench v1.5, a document processing benchmark covering 1,355 pages across nine categories, the model scored 91.09 percent overall, according to the researchers.

That's a 3.73 percentage point improvement over its predecessor, Deepseek OCR. The gains are especially clear in recognizing correct reading order. For document parsing, Deepseek OCR 2 also outperformed Gemini 3 Pro with a comparable token budget.

Benchmark results table showing DeepSeek-OCR 2 achieving 91.09% on OmniDocBench v1.5.
On OmniDocBench v1.5, DeepSeek-OCR 2 achieved the highest score among end-to-end models at 91.09 percent. | Image: Deepseek

In practical use, the repetition rate has also improved. This measures how often the model falls into redundant text loops. When serving as an OCR backend for Deepseek's language models, this rate dropped from 6.25 to 4.17 percent. For batch processing PDFs into training data, it fell from 3.69 to 2.88 percent.

The model does have weak spots, though. It performs worse than its predecessor on newspapers, for example. The researchers point to two factors: the lower token limit could cause problems for text-heavy newspaper pages, and training data included only 250,000 newspaper pages, which wasn't enough material in that category.

A step toward unified multimodal processing

The researchers see DeepEncoder V2 as progress toward standardized multimodal processing. In the future, the encoder architecture could evolve to handle text, speech, and images with the same basic framework, adapting only the query tokens based on modality. According to the paper, this approach could eventually lead to genuine understanding of two-dimensional content.

Code and model weights are publicly available on GitHub and Hugging Face.

Deepseek released the first generation of Deepseek OCR just last October. That system processes text documents as images and cuts memory requirements by a factor of ten. As a result, language models can retain significantly more context, which is useful for long chat histories or extensive documents. According to Deepseek, the system can handle up to 33 million pages per day and works especially well for generating large training datasets.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder