Content
summary Summary

Alibaba's AI research team Qwen has unveiled QVQ-72B-Preview, a new open-source model that can analyze images and draw conclusions from them. While it's still in the experimental phase, early tests show it's particularly good at visual reasoning tasks.

Ad

The model solves problems by thinking through them step by step, similar to what we know from other so-called reasoning models like OpenAI's o1 or Google's Flash Thinking. When users input an image and instructions, the system analyzes the information, takes time to reflect when necessary, and delivers answers with confidence scores for each prediction.

Under the hood, QVQ-72B-Preview is built on Qwen's existing vision-language model, Qwen2-VL-72B, with added capabilities for thinking and reasoning. Qwen says it's the first open-source model of its kind. While it seems similar to their recently released QwQ reasoning model, the team hasn't explained if or how the two models are related.

To put the model through its paces, Qwen used four different benchmarks: MMMU tests college-level visual understanding, MathVista checks how well it can reason through mathematical graphs, MathVision challenges it with math competition problems, and OlympiadBench tests Olympic-level math and physics problems in both Chinese and English.

Ad
Ad

In these tests, QVQ performed better than its predecessor Qwen2-VL-72B-Instruct across the board, reaching similar levels of accuracy as closed-source models like OpenAI's o1 and Claude 3.5 Sonnet.

Comparison table: performance values of 5 AI models (72B-preview to Qwen2-VL) in 4 mathematical benchmarks (MMMU to OlympiadBench).
QVQ performs similarly to OpenAI's o1 and Claude 3.5 Sonnet in benchmarks, but is open source. | Image: Qwen

QVQ-preview

Qwen admits that the model still has some limitations. It can switch between languages unexpectedly, or get stuck in circular reasoning loops - problems that even OpenAI's o1 hasn't solved yet. During complex visual reasoning tasks, the model sometimes loses track of what it's looking at, which can lead to hallucinations. The team also says the model needs stronger safeguards before it's ready for widespread use.

Example of a QVQ's visual problem-solving capability. | Video: Qwen

Qwen describes QVQ as their "last gift" of the year and sees it as one step toward a bigger goal: creating what they call an "omniscient and intelligent model" on the path to artificial general intelligence (AGI). Like OpenAI with GPT-4o, the team plans to build a unified "omni" model that can tackle more complex scientific challenges.

"Imagine an AI that can look at a complex physics problem, and methodically reason its way to a solution with the confidence of a master physicist," the team explains.

Recommendation

QVQ's open-source code and model weights are available via the project page, with a free demo on Hugging Face.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba's Qwen research team has released QVQ-72B-Preview, an open-source language model capable of analyzing visual information and drawing complex conclusions from images and instructions.
  • The model processes information in a step-by-step manner: it reads an image and an instruction, analyzes the content, draws conclusions, and outputs a prediction along with a confidence value. In benchmarks, the model shows strong performance in math and physics compared to its peers.
  • Despite its capabilities, Qwen acknowledges some limitations, such as mixing languages, getting caught in logical loops, and losing focus on the image. The team aims to address these issues and integrate additional modalities to create an "omni" model in the future.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.