AI in practice

Text to coherent HD video: Google merges Phenaki and Imagen Video

Matthias Bastian

Google

Recently, Google introduced two text-to-video models, which it is now combining in a new approach. The result is high-resolution, longer videos generated purely from text.

First, Google showed off Imagen Video, a text-to-video system based on the Imagen image AI that can produce short clips based on text input. The language understanding of a large language model (T5-XXL) is fundamental to the architecture. Imagen Video was trained simultaneously with images and videos.

At the same time, another Google team demonstrated Phenaki, a text-to-video AI that is also trained with videos and images. It can generate minute-long videos based on long text. The Phenaki team uses a transformer architecture with time-dependent causal attention that can string events together along a temporal sequence described in a series of sequential prompts.

Google merges Phenaki and Imagen Video to create long HD videos out of text

The Google research team had already hinted at the possibility of a merger with Imagen Video during the presentation of Phenaki. This has now happened and Google is presenting the result as part of a presentation of current AI projects.

First, Phenaki generates a coherent video based on sequential prompts. Imagen Video then takes the output from Phenaki (prompt and video) and upscales it. Compared to other super-resolution systems, a particular strength of Imagen Video is its ability to incorporate text into the super-resolution module, Google writes.

Alonso Martinez, a lead AI researcher at Google who is involved in the development of Phenaki, believes that at current rates of progress, the technology could be used to produce a major television show in as little as two years.

The technology is still in its infancy, according to Google. You can watch a presentation of the combination of Phenaki and Imagen Video in the following video starting at minute 28:25.

Imagen comes to Googles AI kitchen

Google's first text-to-image systems are expected to be available soon in the AI Kitchen test app (Android / iOS). With the image AI Imagen, Google recently presented what is probably the most powerful model of this kind, but has not published it so far, primarily for ethical reasons.

The rollout in the test kitchen app could indicate a change in strategy here, which would make sense from an economic perspective considering the successes of DALL-E 2, Midjourney and Stable Diffusion in an emerging market.