Show-1 is the best free AI video creator yet
Key Points
- Researchers at the National University of Singapore are developing Show-1, an AI system that generates high-quality video from text descriptions.
- Show-1 combines two architectures of diffusion models: pixel-based and latent-based, to take advantage of both approaches - precise text-to-video alignment and efficient upscaling.
- Show-1 requires only 20-25% of the GPU memory of purely pixel-based models and achieves equivalent or better results than state-of-the-art methods such as Imagen Video or Runways Gen-2, the team said.
Show-1 is a generative AI model for text-to-video that uses a hybrid approach to outperform current alternatives, in some cases significantly.
Researchers at the National University of Singapore have developed Show-1, a new AI system that can generate high-quality videos from text descriptions. The Show-1 model is not related to the model of the same name behind the AI-generated South Park episode.
Show-1 relies on a combination of two different architectures for diffusion models - pixel-based and latent-based - to combine the best of both approaches.
Show-1 combines text alignment with a high-quality result
Pixel-based diffusion models work directly with pixel values and are therefore better able to align generation with the text prompt, but require a lot of computing power. Latent-based approaches, on the other hand, compress the input into a latent space before diffusion. They are more efficient but have difficulty preserving fine textual details.
The Show-1 model combines these two model architectures: pixel-based diffusion is used to generate key frames and low-resolution interpolated images. This captures all motion and content close to the text prompt. Then, latent-based diffusion is used to scale the low-resolution video to high-resolution. The latent model acts as an "expert" to add realistic details.
This hybrid approach combines the best of both worlds - the precise text-to-video alignment of pixel models and the efficient upscaling of latent models.
Video: Zhang, Wu, Liu et al.
According to the team, Show-1 achieves the same or better results in terms of realism and text-to-video alignment than state-of-the-art methods such as Imagen Video or Runways Gen-2, while using only 20-25% of the GPU memory required by purely pixel-based models to generate video, according to the team, which could also make Show-1 attractive for open-source applications.
More information, examples and soon the code and the model are available on the Show-1 project page.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now