Content
summary Summary

Alibaba has launched Wan2.5-Preview, a new video model capable of generating short clips with synchronized audio.

Ad

The system combines text, images, video, and audio in a single architecture, putting it in the same category as Google's Veo 3. Details about how Wan2.5-Preview works are sparse. Alibaba mentions that reinforcement learning with human feedback was used and calls the model "a solid step [...] on our journey towards a 'World Model'". There's no technical report or transparency about training data.

Wan2.5-Preview generates 10-second, 1080p videos with audio tracks that can include multiple voices, background music, and sound effects. In a demo video posted on X, Alibaba strings together several clips to show off its audio generation. At first glance, the audio and visuals seem to match, but a closer look reveals that drumming and music often fall out of sync, and the model struggles to maintain consistent faces.

Video: Alibaba

Ad
Ad

The system takes text, images, or audio as input. Users can, for example, upload a photo and use a text prompt to make a video with matching music. Alibaba advertises "cinematic aesthetics" and a "cinematographic control system."

Wan2.5-Preview also offers image generation and editing at wan.video. The tool can produce photorealistic images, various art styles, and diagrams. Image editing works via voice commands, such as changing colors or combining different concepts.

Screenshot of a video editor interface: drop-down menu with functions (text-to-video selected) and bar with format and duration settings.
The wan.video interface, including its drop-down menus, looks almost identical to OpenAI's Sora. | Image: Screenshot by THE DECODER

Access and Pricing

Wan2.5-Preview is not open source, unlike earlier Alibaba models. Alibaba has not responded to requests for a code release, and there are no signs this will change.

The service is available on wan.video with monthly subscriptions starting at $6.50, or with pay-as-you-go credits. Depending on the plan, each clip costs between 13 and 25 cents. API pricing is between 5 and 15 cents per second, which is well below Veo 3's API cost of 15 to 40 cents per second.

Alibaba's previous model, Wan2.2, was open source under the Apache 2.0 license and could generate 720p videos on consumer GPUs like the RTX 4090.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba has launched Wan2.5-Preview, a new video generation model that creates videos with synchronized sound and blends text, images, video, and audio in a single output.
  • The model can produce ten-second videos at 1080p resolution, but there are still noticeable issues with syncing audio and visuals, as well as keeping faces consistent across frames.
  • Wan2.5-Preview isn't open source and is only available for a fee through the wan.video platform or via API. Pricing is significantly lower than Veo 3.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.