Alibaba Cloud has released a new version of its vision language model, Qwen2-VL. This model can understand images in various resolutions and formats, as well as videos longer than 20 minutes. The smaller versions with 2 and 7 billion parameters are available as open source.
Alibaba's AI research team Qwen reports that the model achieves leading results in several visual comprehension benchmarks, including MathVista, DocVQA, RealWorldQA, and MTVQA.
Qwen2-VL as a visual agent
A key feature of Qwen2-VL is its ability to analyze videos over 20 minutes long, answer questions, conduct dialogues, or generate content based on them.
The model is designed to work as a visual agent that can integrate with devices like phones or robots. Using visual information and text instructions, it can draw complex conclusions, make decisions, and carry out automated actions, the team says.
In addition to English and Chinese, Qwen2-VL can understand text in images in many languages, including most European languages, Japanese, Korean, Arabic, and Vietnamese.
Current limitations include lack of audio support, weaknesses in object counting and spatial reasoning in 3D environments, and knowledge limited to June 2023.
Three model sizes, two open source
Qwen2-VL comes in three sizes: 2, 7, and 72 billion parameters. The 2 and 7 billion parameter versions are open source, available on GitHub and Hugging Face under the Apache 2.0 license.
Alibaba is offering an API for the 72 billion parameter version through its DashScope platform, requiring registration and an API key.
To use the models with Hugging Face Transformers, Qwen recommends installing from the source code. They also provide a toolkit called "qwen-vl-utils" to help process various visual input formats.
Qwen models, developed by Alibaba's cloud computing unit, have seen significant improvements in areas like programming, mathematics, logic, and multilingual understanding with the latest Qwen2 release in early June. These models were trained on data in 27 languages beyond English and Chinese. The team recently released Qwen2 math, an AI model optimized for solving math problems.