Open source Qwen3-VL outperforms Gemini 2.5 Pro in major vision benchmarks, Alibaba reports

Sep 25, 2025

Alibaba has released Qwen3-VL, an open-source language vision model that works with both images and text. The top version, Qwen3-VL-235B-A22B, is available in two variants: "Instruct," which Alibaba reports outperforms Google's Gemini 2.5 Pro on major vision benchmarks, and "Thinking," which scores highly on multimodal reasoning tasks. Detailed benchmark results are available in Alibaba's technical blog.

Qwen3-VL can interact with graphical interfaces, generate code from screenshots, analyze videos up to two hours long, and recognize text in 32 languages, even when image quality is low. The model supports 2D and 3D spatial understanding and is designed to handle math and science tasks.

Video: Qwen3-VL demo shows agentic image processing.

Qwen3-VL is available on Hugging Face, ModelScope, and Alibaba Cloud. Public chat access is available at chat.qwen.ai.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Open source Qwen3-VL outperforms Gemini 2.5 Pro in major vision benchmarks, Alibaba reports

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.