Content
summary Summary

Alibaba Cloud has released a new version of its vision language model, Qwen2-VL. This model can understand images in various resolutions and formats, as well as videos longer than 20 minutes. The smaller versions with 2 and 7 billion parameters are available as open source.

Ad

Alibaba's AI research team Qwen reports that the model achieves leading results in several visual comprehension benchmarks, including MathVista, DocVQA, RealWorldQA, and MTVQA.

Benchmark results of Qwen 2 VL. | Bild: Alibaba Cloud

Qwen2-VL as a visual agent

A key feature of Qwen2-VL is its ability to analyze videos over 20 minutes long, answer questions, conduct dialogues, or generate content based on them.

The model is designed to work as a visual agent that can integrate with devices like phones or robots. Using visual information and text instructions, it can draw complex conclusions, make decisions, and carry out automated actions, the team says.

Ad
Ad
Example of Qwen 2 VL's video understanding. | Bild: Alibaba Cloud

In addition to English and Chinese, Qwen2-VL can understand text in images in many languages, including most European languages, Japanese, Korean, Arabic, and Vietnamese.

Current limitations include lack of audio support, weaknesses in object counting and spatial reasoning in 3D environments, and knowledge limited to June 2023.

Three model sizes, two open source

Qwen2-VL comes in three sizes: 2, 7, and 72 billion parameters. The 2 and 7 billion parameter versions are open source, available on GitHub and Hugging Face under the Apache 2.0 license.

Alibaba is offering an API for the 72 billion parameter version through its DashScope platform, requiring registration and an API key.

To use the models with Hugging Face Transformers, Qwen recommends installing from the source code. They also provide a toolkit called "qwen-vl-utils" to help process various visual input formats.

Recommendation

Qwen models, developed by Alibaba's cloud computing unit, have seen significant improvements in areas like programming, mathematics, logic, and multilingual understanding with the latest Qwen2 release in early June. These models were trained on data in 27 languages beyond English and Chinese. The team recently released Qwen2 math, an AI model optimized for solving math problems.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba's Qwen2-VL achieves top results in visual comprehension tasks and can analyze videos over 20 minutes long.
  • It's designed as a visual agent for device integration, offering complex reasoning and automated actions based on visual and text inputs.
  • The model is available in three sizes, with smaller versions open-sourced and the largest accessible via API.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.