Content
summary Summary

Chinese researchers have built a multimodal AI model that can analyze images, run code, and search the web. Instead of relying on knowledge acquired during training, DeepEyesV2 boosts performance by using external tools intelligently, allowing it to outperform larger models in many cases.

Ad

During early experiments, Xiaohongshu's research team ran into a core issue. Reinforcement learning alone wasn't enough to produce stable tool use in multimodal tasks. The models initially tried to write Python code for image analysis but often generated faulty snippets. As training continued, they began skipping tools entirely.

Why multimodal models need a new training approach

These challenges led the team to develop a two-stage training pipeline. A cold-start phase teaches the model how to connect image understanding with tool use, followed by reinforcement learning to refine those behaviors.

Four panels with examples of tool execution errors, incorrect tool selection, and output parsing errors in DeepEyesV2
The examples highlight three major error types in DeepEyesV2: execution mistakes caused by incorrect crop regions, wrong tool selection, and parsing errors when interpreting search or crop results. | Image: Xiaohongshu

To build high-quality demonstrations, the team used leading models such as Gemini 2.5 Pro, GPT-4o, and Claude Sonnet 4 to generate tool-use trajectories. They kept only those with correct answers and clean code. The reinforcement learning reward system stayed intentionally simple, with rewards tied to answer accuracy and output format.

Ad
Ad

DeepEyesV2 uses three tool categories for multimodal tasks. Code execution handles image processing and numerical analysis. Image search retrieves visually similar content. Text search adds context that isn't visible in the image.

Flowchart of the DeepEyesV2 pipeline with code execution, image and text search in iterative tool integration
The DeepEyesV2 pipeline integrates image operations, Python execution, and image/text search in a loop that adapts to each query. | Image: Xiaohongshu et al.

A new benchmark puts image-tool coordination to the test

To evaluate this approach, the researchers created RealX-Bench, a benchmark designed to test how well models coordinate visual understanding, web search, and reasoning. One example from the study shows how involved these tasks can be. When asked to identify the type of flower in a picture, the model first crops the relevant region to capture details. It then launches a visual web search using the cropped image to find similar flowers, and finally blends those results to determine the species.

Bar chart showing recognition, reasoning, search, and integration accuracy of different models, plus DeepEyesV2 crop+search example for flower species identification.
A case study shows how DeepEyesV2 identifies a moth orchid by combining cropping, image search, and text search, outperforming simpler one-step approaches. | Image: Xiaohongshu et al.

The results reveal a wide gap between AI models and human performance. Even the strongest proprietary model reached only 46 percent accuracy, while humans scored 70 percent.

Tasks that require all three skills are especially challenging. According to the study, Gemini's accuracy dropped from 46 percent overall to just 27.8 percent when recognition, reasoning, and search all had to work together. The drop underscores how current models can handle individual skills but struggle to integrate them.

Illustration of the VLLM thought process: The user submits an image query, the assistant crops the image and searches for results that identify “capybara.”
In this example, DeepEyesV2 first crops the relevant character from the image, then uses an image search to confirm its identity before answering. | Image: Xiaohongshu et al.

DeepEyesV2 reached 28.3 percent overall accuracy. That puts it ahead of its base model Qwen2.5-VL-7B at 22.3 percent, though still behind the 32-billion and 72-billion-parameter versions. But DeepEyesV2 outperformed other open-source models on tasks that require coordination across all three capabilities.

Recommendation

The analysis also found that search tools play a major role in boosting accuracy, with text search providing the biggest gains. This suggests that many models still struggle to meaningfully incorporate information from visual search alone.

How tool use helps smaller models compete

DeepEyesV2 shows its largest gains in specialized benchmarks. In mathematical reasoning tasks, it scored 52.7 percent on MathVerse, a 7.1-point improvement over its base model.

The model also performs well on search-driven tasks. It reached 63.7 percent on MMSearch, outperforming the dedicated MMSearch-R1 model at 53.8 percent. And in everyday image understanding tasks, the 7-billion-parameter DeepEyesV2 even surpassed Qwen2.5-VL-32B, despite having more than four times fewer parameters.

DeepEyesV2 delivers strong results across multiple benchmarks, showing how tool-enabled multimodal reasoning can offset smaller model size. | Image: Xiaohongshu et al

These results suggest that well-structured tool use can offset the limitations of smaller models. Instead of relying on extra parameters, DeepEyesV2 improves performance by pulling in external resources more effectively.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

How DeepEyesV2 adapts its strategy to different tasks

The team's analysis shows clear patterns in how the model chooses tools. For visual perception tasks, it often crops the image to isolate the relevant region. For diagram-based math problems, it blends image analysis with numerical computation. For visually grounded knowledge questions, it launches targeted web searches based on the image.

After reinforcement learning, the model became noticeably more adaptive. It used tools less often overall, suggesting it had learned to call them only when needed. But the high variance in tool use across tasks shows that it continues to tailor its strategy to each problem type.

Xiaohongshu has been increasing its footprint in the global AI landscape. Its first open-source language model, dots.llm1, delivered competitive results and outperformed models from Alibaba and Deepseek in efficiency. Its character recognition model, dots.ocr, showed similar capabilities.

The earlier DeepEyes release in May already combined reasoning with multimodal understanding. DeepEyesV2 builds on that foundation, aiming to bring these capabilities together in more agent-like environments. Xiaohongshu, also known locally as Rednote, runs a major Chinese social media and e-commerce platform similar to TikTok.

DeepEyesV2 is available on Hugging Face and GitHub under the Apache License 2.0 and can be used commercially.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Chinese researchers at Xiaohongshu have introduced DeepEyesV2, a multimodal AI model that can analyze images, run code, and search the web.
  • The model outperforms other open-source models, especially when handling tasks that require combining several abilities.
  • DeepEyesV2 is freely accessible on Hugging Face and Github, reflecting Xiaohongshu's efforts to become a leading force in multimodal AI agents.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.