Content
summary Summary

Show-1 is a generative AI model for text-to-video that uses a hybrid approach to outperform current alternatives, in some cases significantly.

Researchers at the National University of Singapore have developed Show-1, a new AI system that can generate high-quality videos from text descriptions. The Show-1 model is not related to the model of the same name behind the AI-generated South Park episode.

Show-1 relies on a combination of two different architectures for diffusion models - pixel-based and latent-based - to combine the best of both approaches.

Show-1 combines text alignment with a high-quality result

Pixel-based diffusion models work directly with pixel values and are therefore better able to align generation with the text prompt, but require a lot of computing power. Latent-based approaches, on the other hand, compress the input into a latent space before diffusion. They are more efficient but have difficulty preserving fine textual details.

Ad
Ad

The Show-1 model combines these two model architectures: pixel-based diffusion is used to generate key frames and low-resolution interpolated images. This captures all motion and content close to the text prompt. Then, latent-based diffusion is used to scale the low-resolution video to high-resolution. The latent model acts as an "expert" to add realistic details.

This hybrid approach combines the best of both worlds - the precise text-to-video alignment of pixel models and the efficient upscaling of latent models.

Video: Zhang, Wu, Liu et al.

According to the team, Show-1 achieves the same or better results in terms of realism and text-to-video alignment than state-of-the-art methods such as Imagen Video or Runways Gen-2, while using only 20-25% of the GPU memory required by purely pixel-based models to generate video, according to the team, which could also make Show-1 attractive for open-source applications.

More information, examples and soon the code and the model are available on the Show-1 project page.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at the National University of Singapore are developing Show-1, an AI system that generates high-quality video from text descriptions.
  • Show-1 combines two architectures of diffusion models: pixel-based and latent-based, to take advantage of both approaches - precise text-to-video alignment and efficient upscaling.
  • Show-1 requires only 20-25% of the GPU memory of purely pixel-based models and achieves equivalent or better results than state-of-the-art methods such as Imagen Video or Runways Gen-2, the team said.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.