Alibaba has released Qwen3-VL, an open-source language vision model that works with both images and text. The top version, Qwen3-VL-235B-A22B, is available in two variants: "Instruct," which Alibaba reports outperforms Google's Gemini 2.5 Pro on major vision benchmarks, and "Thinking," which scores highly on multimodal reasoning tasks. Detailed benchmark results are available in Alibaba's technical blog.

Ad

Qwen3-VL can interact with graphical interfaces, generate code from screenshots, analyze videos up to two hours long, and recognize text in 32 languages, even when image quality is low. The model supports 2D and 3D spatial understanding and is designed to handle math and science tasks.

Video: Qwen3-VL demo shows agentic image processing.

Qwen3-VL is available on Hugging Face, ModelScope, and Alibaba Cloud. Public chat access is available at chat.qwen.ai.

Ad
Ad
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.