Content
summary Summary

StableVideo brings some video editing capabilities to Stable Diffusion, such as allowing style transitions or changing backgrounds.

Ad

Generating realistic and temporally coherent videos from text prompts remains a challenge for AI systems, with even state-of-the-art systems such as those from RunwayML still showing significant inconsistencies.

While there is still much work to be done on this frontier, some research, such as StableVideo, is exploring how generative AI can build on existing videos. Instead of generating videos from scratch, StableVideo uses diffusion models such as Stable Diffusion to edit videos frame by frame.

According to the researchers, this ensures consistency across frames through a technique that transfers information between key frames. This allows objects and backgrounds to be modified semantically while maintaining continuity, similar to VideoControlNet.

Ad
Ad

StableVideo introduces inter-frame propagation for better consistency

It does this by introducing "inter-frame propagation" to diffusion models. This propagates object appearances between keyframes, allowing for consistent generation across the video sequence.

Specifically, StableVideo first selects key frames and uses a standard diffusion model such as Stable Diffusion to process them based on text prompts. The model takes visual structure into account to preserve shapes. It then transfers information from one edited keyframe to the next using their common overlap in the video. This guides the model to generate subsequent frames consistently.

Bild: Chai et al.

Finally, an aggregation step combines the edited keyframes to create edited foreground and background video layers. Compositing these layers produces the final, coherent result.

Video: Chai et al.

Video: Chai et al.

Recommendation

In experiments, the team demonstrates StableVideo's ability to perform various text-based edits, such as changing object attributes or applying artistic styles, while maintaining strong visual continuity throughout.

However, there are limitations: performance still depends on the capabilities of the underlying diffusion model, according to the researchers. Consistency also fails for complex deforming objects, which requires further research.

More information and the code is available on the StableVideo GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • StableVideo is a research project exploring generative AI to edit existing videos frame by frame rather than generating videos from scratch, potentially increasing their realism and consistency.
  • The system uses diffusion models such as Stable Diffusion to process keyframes based on text prompts, and then transfers information to other keyframes using an inter-frame propagation technique, maintaining visual continuity.
  • Although the project demonstrates promising results, researchers acknowledge its limitations, such as performance dependency on the underlying diffusion model and inconsistency in dealing with complex deforming objects.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.