Content
summary Summary

Recently, Google introduced two text-to-video models, which it is now combining in a new approach. The result is high-resolution, longer videos generated purely from text.

First, Google showed off Imagen Video, a text-to-video system based on the Imagen image AI that can produce short clips based on text input. The language understanding of a large language model (T5-XXL) is fundamental to the architecture. Imagen Video was trained simultaneously with images and videos.

At the same time, another Google team demonstrated Phenaki, a text-to-video AI that is also trained with videos and images. It can generate minute-long videos based on long text. The Phenaki team uses a transformer architecture with time-dependent causal attention that can string events together along a temporal sequence described in a series of sequential prompts.

Google merges Phenaki and Imagen Video to create long HD videos out of text

The Google research team had already hinted at the possibility of a merger with Imagen Video during the presentation of Phenaki. This has now happened and Google is presenting the result as part of a presentation of current AI projects.

Ad
Ad

First, Phenaki generates a coherent video based on sequential prompts. Imagen Video then takes the output from Phenaki (prompt and video) and upscales it. Compared to other super-resolution systems, a particular strength of Imagen Video is its ability to incorporate text into the super-resolution module, Google writes.

Alonso Martinez, a lead AI researcher at Google who is involved in the development of Phenaki, believes that at current rates of progress, the technology could be used to produce a major television show in as little as two years.

The technology is still in its infancy, according to Google. You can watch a presentation of the combination of Phenaki and Imagen Video in the following video starting at minute 28:25.

Imagen comes to Googles AI kitchen

Google's first text-to-image systems are expected to be available soon in the AI Kitchen test app (Android / iOS). With the image AI Imagen, Google recently presented what is probably the most powerful model of this kind, but has not published it so far, primarily for ethical reasons.

The rollout in the test kitchen app could indicate a change in strategy here, which would make sense from an economic perspective considering the successes of DALL-E 2, Midjourney and Stable Diffusion in an emerging market.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google reaches the next milestone in AI generation of videos from plain text: coherent videos in HD resolution.
  • For this, Google is combining two text-to-video systems: Imagen Video is capable of generating high-definition video, and Phenaki has the ability to generate temporally consistent image sequences along sequential prompts.
  • In a new demo, Google shows how the combination of Phenaki and Imagen Video results in high-resolution, long and consistent HD videos.
  • Alonso Martinez, a lead AI researcher at Google who is involved in the development of Phenaki, believes that at current rates of progress, the technology could be used to produce a major television show in as little as two years.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.