Alibaba's Qwen2-VL is designed as a visual agent that can analyze over 20 minutes of video

Sep 1, 2024

Alibaba Cloud

Alibaba Cloud has released a new version of its vision language model, Qwen2-VL. This model can understand images in various resolutions and formats, as well as videos longer than 20 minutes. The smaller versions with 2 and 7 billion parameters are available as open source.

Alibaba's AI research team Qwen reports that the model achieves leading results in several visual comprehension benchmarks, including MathVista, DocVQA, RealWorldQA, and MTVQA.

Benchmark results of Qwen 2 VL. | Bild: Alibaba Cloud

Qwen2-VL as a visual agent

A key feature of Qwen2-VL is its ability to analyze videos over 20 minutes long, answer questions, conduct dialogues, or generate content based on them.

The model is designed to work as a visual agent that can integrate with devices like phones or robots. Using visual information and text instructions, it can draw complex conclusions, make decisions, and carry out automated actions, the team says.

Example of Qwen 2 VL's video understanding. | Bild: Alibaba Cloud

In addition to English and Chinese, Qwen2-VL can understand text in images in many languages, including most European languages, Japanese, Korean, Arabic, and Vietnamese.

Current limitations include lack of audio support, weaknesses in object counting and spatial reasoning in 3D environments, and knowledge limited to June 2023.

Three model sizes, two open source

Qwen2-VL comes in three sizes: 2, 7, and 72 billion parameters. The 2 and 7 billion parameter versions are open source, available on GitHub and Hugging Face under the Apache 2.0 license.

Alibaba is offering an API for the 72 billion parameter version through its DashScope platform, requiring registration and an API key.

To use the models with Hugging Face Transformers, Qwen recommends installing from the source code. They also provide a toolkit called "qwen-vl-utils" to help process various visual input formats.

Qwen models, developed by Alibaba's cloud computing unit, have seen significant improvements in areas like programming, mathematics, logic, and multilingual understanding with the latest Qwen2 release in early June. These models were trained on data in 27 languages beyond English and Chinese. The team recently released Qwen2 math, an AI model optimized for solving math problems.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Alibaba's Qwen2-VL is designed as a visual agent that can analyze over 20 minutes of video

Qwen2-VL as a visual agent

Three model sizes, two open source

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.