Content
summary Summary

Alibaba has introduced Qwen3-Omni, a native multimodal AI model designed to process text, images, audio, and video in real time.

Ad

According to Alibaba, Qwen3-Omni ranks at the top on 32 out of 36 audio and video benchmarks, outperforming established models like Gemini 2.5 Flash and GPT-4o in tasks such as speech comprehension and voice generation. In specialized areas, its performance matches models built for a single input type.

Infographic on Qwen3-Omni: smarter reasoning, multilingual responses, 234 ms latency, and transcription up to 30 minutes.
Qwen3-Omni delivers stronger reasoning for complex tasks, supports multilingual dialog, offers fast response times with just 234 ms latency, and can transcribe up to 30 minutes of audio. | Image: Alibaba

While Alibaba hasn't released a technical report, blog posts and benchmark results offer some details. The 30-billion-parameter model uses a mixture-of-experts architecture, activating three billion parameters per inference. Qwen3-Omni processes audio input in 211 milliseconds and combined audio and video in 507 milliseconds.

Given its relatively compact architecture, it's impressive that Qwen3-Omni can keep up with leading commercial models in Alibaba's chosen benchmarks. That said, it's still an open question whether it can consistently match the performance of models like GPT-4o or Gemini 2.5 Flash in everyday use, since smaller models often struggle outside of controlled tests.

Ad
Ad
Benchmark table comparing the performance of Qwen3-Omni-30B-A3B with competing models from Qwen, GPT, and Gemini in various categories such as text, audio, speech, image, and video. Qwen3-Omni achieves the best scores in most tests.
On most audio, text, and image processing benchmarks, Qwen3-Omni-30B-A3B outperforms other models. | Image: Alibaba

Two-part architecture for real-time processing

Qwen3-Omni uses a two-part system: the "Thinker" analyzes input and generates text, while the "Talker" turns this output directly into speech. Both components work in parallel to minimize lag.

Technical diagram of the Qwen3-Omni architecture with Vision Encoder and AuT Audio Encoder as input, Qwen3-Omni MoE Thinker for text processing, Qwen3-Omni MoE Talker for speech generation, and an MTP module with streaming codec decoder for audio output. Different token types are color-coded.
Qwen3-Omni's split architecture separates text processing ("Thinker") and real-time speech output ("Talker"). | Image: Alibaba

For real-time output, Qwen3-Omni generates audio step by step instead of creating whole audio files at once. Each processing step is converted immediately into audible speech, allowing for seamless streaming. The audio encoder was trained on 20 million hours of audio. Both main components use multiple specialized subsystems running in parallel, enabling high throughput and fast response.

Broad language support

The model processes text in 119 languages, understands spoken language in 19, and can respond in 10. It can analyze and summarize up to 30 minutes of audio.

Alibaba says the model is trained to perform equally well across all its supported input types. The company claims there are no trade-offs in any one area, even when handling multiple modalities at once.

Users can tweak the model's behavior through special instructions, such as altering response style or personality. Qwen3-Omni can also connect to external tools and services for more complex tasks.

Recommendation

Dedicated model for audio descriptions

Alibaba is also releasing Qwen3-Omni-30B-A3B-Captioner, a separate model built for detailed analysis of audio content like music. The aim is to generate accurate, low-error descriptions and address a gap in the open-source ecosystem.

Music analysis UI shows genre and atmosphere of romantic lo-fi pop/indie folk composition, purple wave background.
Qwen3-Omni's AI can generate detailed genre and mood descriptions for audio. | Image: Alibaba

Alibaba says it plans to improve multi-speaker recognition, add text recognition for video, and boost learning from audio-video combinations. The company is also working on expanding autonomous agent capabilities.

Qwen3-Omni is available through Qwen Chat and as a demo on Hugging Face. Developers can plug the model into their own apps using Alibaba's API platform.

There are also two open source versions: Qwen3-Omni-30B-A3B-Instruct for instruction following, and Qwen3-Omni-30B-A3B-Thinking for complex reasoning.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Alibaba's YouTube demo shows Qwen3-Omni translating a restaurant menu in real time using a wearable. The release follows the launch of the Quark AI Glasses and rising popularity of Alibaba's Quark chatbot in Chinese app stores. With an English-language ad, Alibaba is clearly looking beyond China and targeting users in Western markets.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba has introduced Qwen3-Omni, a multimodal AI model capable of processing text, images, audio, and video in real time, with benchmarks showing it outperforms established competitors like Gemini 2.5 Flash and GPT-4o on most audio and video tasks.
  • The model uses a two-part system—"Thinker" for analysis and text generation, and "Talker" for converting output to speech—enabling fast response times, support for 119 written languages, and the ability to transcribe and summarize up to 30 minutes of audio.
  • Alibaba has released specialized open-source versions for instruction following and complex reasoning, as well as a dedicated captioning model for detailed audio analysis, while also targeting international markets through demos and English-language promotions.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.