Content
newsletter Newsletter

AI researchers manipulate faces in videos believably and consistently with Nvidia's StyleGAN2. Deepfakes thus become even more versatile.

Generative Adversarial Networks (GANs) form the basis of many current methods for image generation and manipulation. One commonly used network is Nvidia's StyleGAN, which was recently released with some improvements as StyleGAN3.

StyleGAN can generate believable images of faces, animals, or other subjects. With additional tools, the network can also manipulate these images. One example is StyleCLIP, which uses StyleGAN to generate and manipulate images based on text descriptions.

StyleGAN previously not suitable for videos - this is now changing

While generating and manipulating individual images with artificial intelligence can produce photo-realistic results, processing video remains a major challenge.

Ad
Ad

For example, individual images can be generated or manipulated and combined into a video. But the temporal coherence from image to image is missing: hairstyles shift, eyes suddenly look in a different direction or the lighting in the face changes.

In order to transfer the successes of GANs, for example in editing faces, to videos, GANs could theoretically be trained with videos - but the project fails simply because of the lack of high-quality videos of faces. Models like Nvidia StyleGAN require tens of millions of images for AI training.

New method uses StyleGAN for videos

As a new method from AI researchers at Tel Aviv University now shows, video training isn't necessary at all - at least for face manipulation in short video clips. Instead, the team relies on an extended StyleGAN architecture that exploits the temporal coherence present in the original video.

First, the AI system separates the video into individual frames for this purpose, from which the face is cut out and aligned horizontally. Then, a StyleGAN2 model with an e4e encoder generates a copy for each face within the network. The copies are then fine-tuned with the originals to correct inaccuracies and ensure coherence.

Video: Tzaban et al

Recommendation

Next, the copies are edited as desired - a smile is added, a character is rejuvenated or aged. In the penultimate step, the resulting faces and their backgrounds are stitched together and finally merged into a new video.

Video: Tzaban et al

The results are impressive, as is the performance: a single video can be computed in about 1.5 hours on an Nvidia RTX 2080. The researchers still want to fix existing small errors, such as missing pigtails or unstable facial features, in the future, for example, with the use of StyleGAN3.

Video: Tzaban et al

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

More information, as well as examples and soon the code, are available on the project page of "Stitch it inTime".

Read more about AI:

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.