AI researchers manipulate faces in videos believably and consistently with Nvidia's StyleGAN2. Deepfakes thus become even more versatile.
Generative Adversarial Networks (GANs) form the basis of many current methods for image generation and manipulation. One commonly used network is Nvidia's StyleGAN, which was recently released with some improvements as StyleGAN3.
StyleGAN can generate believable images of faces, animals, or other subjects. With additional tools, the network can also manipulate these images. One example is StyleCLIP, which uses StyleGAN to generate and manipulate images based on text descriptions.
StyleGAN previously not suitable for videos - this is now changing
While generating and manipulating individual images with artificial intelligence can produce photo-realistic results, processing video remains a major challenge.
For example, individual images can be generated or manipulated and combined into a video. But the temporal coherence from image to image is missing: hairstyles shift, eyes suddenly look in a different direction or the lighting in the face changes.
In order to transfer the successes of GANs, for example in editing faces, to videos, GANs could theoretically be trained with videos - but the project fails simply because of the lack of high-quality videos of faces. Models like Nvidia StyleGAN require tens of millions of images for AI training.
New method uses StyleGAN for videos
As a new method from AI researchers at Tel Aviv University now shows, video training isn't necessary at all - at least for face manipulation in short video clips. Instead, the team relies on an extended StyleGAN architecture that exploits the temporal coherence present in the original video.
First, the AI system separates the video into individual frames for this purpose, from which the face is cut out and aligned horizontally. Then, a StyleGAN2 model with an e4e encoder generates a copy for each face within the network. The copies are then fine-tuned with the originals to correct inaccuracies and ensure coherence.
Next, the copies are edited as desired - a smile is added, a character is rejuvenated or aged. In the penultimate step, the resulting faces and their backgrounds are stitched together and finally merged into a new video.
The results are impressive, as is the performance: a single video can be computed in about 1.5 hours on an Nvidia RTX 2080. The researchers still want to fix existing small errors, such as missing pigtails or unstable facial features, in the future, for example, with the use of StyleGAN3.
More information, as well as examples and soon the code, are available on the project page of "Stitch it inTime".