Content
summary Summary

Scientists at Alibaba Group have introduced VACE, a general-purpose AI model designed to handle a broad range of video generation and editing tasks within a single system.

Ad

The model’s backbone is an enhanced diffusion transformer architecture, but the headline here is the new input format: the "Video Condition Unit" (VCU). The VCU is Alibaba’s answer to the perennial mess of multimodal inputs—it takes everything from text prompts to sequences of reference images or videos, plus spatial masks, and distills them into a unified representation. The team engineered dedicated mechanisms to get these disparate inputs working together, rather than clashing with each other.

Two flowcharts illustrating (a) full fine-tuning and (b) context-aware adaptation in few-shot prompt tuning architectures for generative AI models.
VACE uses a technique called "concept decoupling" to split each image into editable and fixed regions, giving the model fine-grained control over what gets changed and what stays put. | Image: Jiang et al.

The process starts with masks dividing the image into "reactive" areas—targets for modification—and "inactive" zones that are left untouched. All of this visual information is embedded into a shared feature space and combined with the corresponding text input.

To keep the generated video consistent from frame to frame, VACE maps these features into a latent space built to match the structure of the diffusion transformer. Time-embedding layers ensure that the model’s understanding of the sequence doesn’t fall apart as it moves through each frame. An attention mechanism then ties together features from different modalities and timesteps, so the system can handle everything as a cohesive whole—whether it’s producing new video content or editing existing footage.

Ad
Ad

Text-to-video, reference-to-video and video editing

VACE’s toolkit covers four core tasks: generating videos from text prompts, synthesizing new footage based on reference images or clips, editing video-to-video, and applying masks for targeted edits. This one-model-fits-most approach unlocks a pretty broad set of use cases.

In practice, the demos are all over the map—VACE can animate a person walking out of frame, conjure up an anime character surfing, swap penguins for kittens, or expand a background to keep things visually seamless. If you want to see the breadth of what it can do, there are more examples on the project’s official website.

 

Series of images with examples of creative image manipulation using the VACE model for tasks such as referencing, moving, animating, rearranging, and expanding.
From referencing and animation to object rearrangement and scene expansion, VACE demonstrates a broad range of visual synthesis capabilities. | Image: Jiang et al.

Training started with the basics: the team focused first on inpainting and outpainting to shore up the text-to-video pipeline, then layered in reference images and moved on to more advanced editing tasks. For data, they pulled from internet videos—automatically filtering, segmenting, and enriching them with depth and pose annotations.

Benchmarking VACE across twelve video editing tasks

To actually measure how VACE stacks up, the researchers put together a dedicated benchmark: 480 examples covering a dozen video editing tasks, including inpainting, outpainting, stylization, depth control, and reference-guided generation. According to their results, VACE outperforms specialized open-source models across the board in both quantitative metrics and user studies. That said, there’s still a gap on reference-to-video generation, where commercial models like Vidu and Kling have the edge.

Recommendation

Alibaba’s researchers pitch VACE as an important step towards universal, multimodal video models, and the next move is pretty predictable—scaling up with bigger datasets and more compute. Some parts of the model are set to land as an open-source release on GitHub.

VACE fits into the bigger picture of Alibaba’s AI ambitions, alongside a string of recent large language model releases—especially the Qwen series. Other Chinese tech giants like ByteDance are pushing hard on video AI as well, sometimes matching or beating Western offerings like OpenAI’s Sora or Google's Veo 2.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba Group researchers have introduced VACE, a universal AI model that unites multiple video generation and editing tasks in one system by enhancing the diffusion transformer architecture with a "Video Condition Unit" for multimodal inputs.
  • VACE handles four main tasks—text-to-video, reference-to-video, video-to-video editing, and masked video editing—supporting applications like character animation, object replacement, and background extension.
  • In tests on a benchmark of 480 examples across 12 tasks, VACE outperformed specialized open source models, and the team plans to release parts of the model as open source to advance universal, multimodal video models.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.