Alibaba has released Wan2.2, the latest version of its open-source video generation model. The smallest version can generate 720P videos on a single RTX 4090 GPU.
The company says Wan2.2 brings significant improvements in generation quality and model capabilities compared to Wan2.1. The model is available under the Apache 2.0 license and comes in three main versions: T2V-A14B for text-to-video, I2V-A14B for image-to-video, and TI2V-5B for combined text-and-image-to-video generation.
The A14B models generate 5-second videos at 720P and 16fps. For the TI2V-5B model, Alibaba specifies a special 720P resolution of 1280×704 or 704×1280 pixels.
MoE architecture boosts efficiency
The biggest change in Wan2.2 is the introduction of a Mixture-of-Experts (MoE) architecture in its video diffusion models. The A14B models use a two-expert design, totaling 27 billion parameters, but with only 14 billion active parameters per inference step.
The first expert focuses on the early denoising stages, where noise is high, and the overall layout is established. The second expert handles later stages to refine video details.
Alibaba says it has also significantly expanded the training dataset for Wan2.2, using 65.6 percent more images and 83.2 percent more videos than Wan2.1.
Compact 5B model for consumer hardware
Alongside the 27B MoE models, Alibaba has developed a more compact 5B model called TI2V-5B. This version can generate 5-second 720P videos in under 9 minutes on a single consumer GPU like the RTX 4090, making it the fastest model to reach this quality on that hardware.
TI2V-5B supports both text-to-video and image-to-video generation in a unified framework, producing 720P videos at 24fps. For the larger A14B models, Alibaba recommends at least 80GB of VRAM for single-GPU inference.
Integration and availability
The models are available through Hugging Face and ModelScope. Wan2.2 is already integrated with ComfyUI and Diffusers.
A Hugging Face Space is available for direct use of the TI2V-5B model.