Lumiere: Google shows new generative AI for realistic videos

Google introduces a new text-to-video model that outperforms alternative models and can be seen as a new standard.

Google researchers have developed a new text-to-video (T2V) diffusion model called Lumiere that is capable of generating realistic AI videos that overcome many of the problems of alternative approaches.

Lumiere uses a new Space-Time U-Net (STUNet) architecture that enables the generation of videos with coherent motion and high quality. The method is fundamentally different from previous approaches based on a cascade of models that can only process parts of the video at a time.

Lumiere can also be used for other applications such as video inpainting, image-to-video generation, and stylized video. The model was trained on 30 million videos and shows competitive results in terms of video quality and text matching compared to other methods. The model was trained on 30 million videos with associated text captions. The videos have a length of 80 frames at 16 frames per second (fps) and last 5 seconds each. The model is based on a pre-trained frozen text-to-image model, which was extended by additional layers for video-relevant aspects such as the temporal dimension.

Google's Lumiere relies on spatial and temporal down- and up-sampling

Unlike previous T2V models, which first generate keyframes and then use Temporal Super-Resolution (TSR) models to insert missing frames between those keyframes, Lumiere generates the entire video sequence at once. This allows for more coherent and realistic motion throughout the video.

This is made possible by the STUNet architecture, which not only downsamples and then upsamples the spatial resolution like existing methods, but also the temporal resolution. The number of frames per second in a video is downsampled and then upsampled again. By downsampling, the model processes the video at this reduced temporal resolution but still sees the full length of the video - just with fewer frames. In this way, the model learns how objects and scenes move and change over this reduced number of frames.

Once the model has learned the basic motion patterns at this reduced resolution, it can build on them to improve the final video quality at full temporal resolution. This process allows for more efficient handling of the video without compromising the quality of the generated motion and scenes.

Once the video has been generated at this lower temporal and spatial resolution, Lumiere uses Multidiffusion for spatial super-resolution (SSR). This involves dividing the video into overlapping segments and enhancing each segment individually to increase resolution. These segments are then stitched together to create a coherent, high-resolution video. This process makes it possible to produce high-quality video without the massive resources required for direct high-resolution production.

According to Google, Lumiere outperformed existing text-to-video models such as Imagen Video, Pika, Stable Video Diffusion, and Gen-2 in a user study. Despite its strengths, much remains to be done: Lumiere is also not designed to generate videos with multiple scenes or transitions between scenes, which poses a challenge for future research.

Recommendation

AI research

Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

More examples and information can be found on the Lumiere project page.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Lumiere: Google shows new generative AI for realistic videos

Google's Lumiere relies on spatial and temporal down- and up-sampling

Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

Goku models from ByteDance can generate realistic product videos without human actors

Tencent introduces open source video generator HunyuanVideo and challenges Sora

Genmo Mochi 1: A new benchmark for open AI video models

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Lumiere: Google shows new generative AI for realistic videos

Google's Lumiere relies on spatial and temporal down- and up-sampling

Share

Bank details