Deepfakes are now even more versatile

Jan 23, 2022

Tzaban et al

AI researchers manipulate faces in videos believably and consistently with Nvidia's StyleGAN2. Deepfakes thus become even more versatile.

Generative Adversarial Networks (GANs) form the basis of many current methods for image generation and manipulation. One commonly used network is Nvidia's StyleGAN, which was recently released with some improvements as StyleGAN3.

StyleGAN can generate believable images of faces, animals, or other subjects. With additional tools, the network can also manipulate these images. One example is StyleCLIP, which uses StyleGAN to generate and manipulate images based on text descriptions.

StyleGAN previously not suitable for videos - this is now changing

While generating and manipulating individual images with artificial intelligence can produce photo-realistic results, processing video remains a major challenge.

For example, individual images can be generated or manipulated and combined into a video. But the temporal coherence from image to image is missing: hairstyles shift, eyes suddenly look in a different direction or the lighting in the face changes.

In order to transfer the successes of GANs, for example in editing faces, to videos, GANs could theoretically be trained with videos - but the project fails simply because of the lack of high-quality videos of faces. Models like Nvidia StyleGAN require tens of millions of images for AI training.

New method uses StyleGAN for videos

As a new method from AI researchers at Tel Aviv University now shows, video training isn't necessary at all - at least for face manipulation in short video clips. Instead, the team relies on an extended StyleGAN architecture that exploits the temporal coherence present in the original video.

First, the AI system separates the video into individual frames for this purpose, from which the face is cut out and aligned horizontally. Then, a StyleGAN2 model with an e4e encoder generates a copy for each face within the network. The copies are then fine-tuned with the originals to correct inaccuracies and ensure coherence.

Video: Tzaban et al

Next, the copies are edited as desired - a smile is added, a character is rejuvenated or aged. In the penultimate step, the resulting faces and their backgrounds are stitched together and finally merged into a new video.

Video: Tzaban et al

The results are impressive, as is the performance: a single video can be computed in about 1.5 hours on an Nvidia RTX 2080. The researchers still want to fix existing small errors, such as missing pigtails or unstable facial features, in the future, for example, with the use of StyleGAN3.

Video: Tzaban et al

More information, as well as examples and soon the code, are available on the project page of "Stitch it in Time".

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Deepfakes are now even more versatile

StyleGAN previously not suitable for videos - this is now changing

New method uses StyleGAN for videos

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.