Content
summary Summary

Google has unveiled VideoPoet, a new generative AI system that can create and edit videos from text and other input.

According to Google, VideoPoet is a large language model designed for a variety of video generation tasks, including text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio. Unlike competing models, VideoPoet integrates many capabilities into a single model, rather than relying on separately trained components for each task.

Video: Google

VideoPoet uses multiple tokenizers (MAGVIT V2 for video and image and SoundStream for audio) to train an autoregressive language model across video, image, audio, and text modalities. Once the model generates tokens conditioned on some context, these can be converted back into a viewable representation with the tokenizer decoders.

Ad
Ad

Video: Google

VideoPoet can generate videos with variable length and a range of motions and styles, depending on the text content. It can also take an input image and animate it with a prompt, predict optical flow and depth information for video stylization, and generate audio. By default, the model generates videos in portrait orientation to tailor its output towards short-form content.

Video: Google

Camera movements can also be controlled in videos by using text prompts to describe camera movement.

Video: Google

Recommendation

VideoPoet can also create videos with sound, like this cat playing the piano.

Video: Google

VideoPoet a step towards "any-to-any" generation

According to Google, VideoPoet was evaluated against several benchmarks, and the videos generated were compared to those of other models. On average, participants preferred between 24 and 35% of VideoPoet's examples because they matched the prompt better than competing models such as Phenaki, VideoCrafter and Show-1.

According to Google, the framework could support "any-to-any" generation in the future and be extended to text-to-audio, audio-to-video and video captioning, "among many others".

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Using Bard as a scriptwriter, Google has also produced a short film using VideoPoet:

The company has not revealed whether it has any plans to make the model available, but integration into a planned Bard Advanced at some point seems possible. More examples in full resolution can be found on the VideoPoet project page.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google presents VideoPoet, an AI system that can generate and edit video from text and other inputs, including text-to-video, image-to-video, and video stylization.
  • VideoPoet is a large language model trained with multiple tokenizers for video, image, audio, and text modalities, allowing it to integrate many capabilities into a single model.
  • In the future, the framework could support any-to-any generation and be extended to text-to-audio, audio-to-video, and video subtitling to enable even more versatile applications.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.