Alibaba has added a multimodal visual language model to its Qwen2.5 series, marking another step in the Chinese tech company's effort to compete in the commercial AI space.
Building on the open source Qwen2-VL model from fall 2024, the new version promises better handling of various data types including text, images and hour-long videos. The team says they've made particular progress with diagrams, icons, graphics and layouts. The model comes in three sizes: 3, 7 and 72 billion parameters.
These improvements also make the model useful as a visual assistant, according to the company. In several demos, Alibaba shows Qwen2.5 analyzing screen content and providing instructions for tasks like booking flights, checking weather forecasts, and navigating complex interfaces like Gimp.
While Qwen2.5 isn't a specialized agent model like OpenAI's recently introduced CUA, it excels at analyzing interfaces, identifying relevant buttons and planning workflows. This capability could make it valuable as the foundation for an open operator-like system.
Qwen2.5 beats GPT-4o and Claude 3.5 Sonnet in benchmarks
The model can identify specific objects and their components - for instance, determining whether a motorcyclist is wearing a helmet. When processing documents like invoices and forms, it can output the information in structured formats like JSON for easy reuse.
According to Alibaba, the largest version, Qwen2.5-VL-72B, performs on par with OpenAI's GPT-4o, Claude 3.5 Sonnet and Gemini 2.0 Flash across various benchmarks, sometimes outperforming them in areas like document comprehension and visual assistance without special training.
The smaller versions, Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-3B, also show improvements over GPT-4o-Mini and the previous Qwen2-VL in many tasks.
Qwen plans omnimodal model like GPT-4o
Looking ahead, the Qwen team plans to enhance the models' problem-solving and reasoning capabilities while adding support for more input types. Their ultimate goal is to create an AI model that can handle any kind of input or task, including audio. A detailed paper on the model's architecture and training process is in development.
The Qwen2.5 VL models are available open source through GitHub, Hugging Face and ModelScope, as well as through Qwen Chat, though some commercial use restrictions apply. Due to Chinese regulations, these models, like those from Deepseek, avoid discussing certain topics deemed sensitive by authorities.
The company recently expanded the Qwen2.5 series to include a model supporting context windows of up to one million tokens.