Show-1 is a generative AI model for text-to-video that uses a hybrid approach to outperform current alternatives, in some cases significantly.
Researchers at the National University of Singapore have developed Show-1, a new AI system that can generate high-quality videos from text descriptions. The Show-1 model is not related to the model of the same name behind the AI-generated South Park episode.
Show-1 relies on a combination of two different architectures for diffusion models - pixel-based and latent-based - to combine the best of both approaches.
Show-1 combines text alignment with a high-quality result
Pixel-based diffusion models work directly with pixel values and are therefore better able to align generation with the text prompt, but require a lot of computing power. Latent-based approaches, on the other hand, compress the input into a latent space before diffusion. They are more efficient but have difficulty preserving fine textual details.
The Show-1 model combines these two model architectures: pixel-based diffusion is used to generate key frames and low-resolution interpolated images. This captures all motion and content close to the text prompt. Then, latent-based diffusion is used to scale the low-resolution video to high-resolution. The latent model acts as an "expert" to add realistic details.
This hybrid approach combines the best of both worlds - the precise text-to-video alignment of pixel models and the efficient upscaling of latent models.
According to the team, Show-1 achieves the same or better results in terms of realism and text-to-video alignment than state-of-the-art methods such as Imagen Video or Runways Gen-2, while using only 20-25% of the GPU memory required by purely pixel-based models to generate video, according to the team, which could also make Show-1 attractive for open-source applications.
More information, examples and soon the code and the model are available on the Show-1 project page.