Deepfakes are now even more versatile

AI researchers manipulate faces in videos believably and consistently with Nvidia's StyleGAN2. Deepfakes thus become even more versatile.

Generative Adversarial Networks (GANs) form the basis of many current methods for image generation and manipulation. One commonly used network is Nvidia's StyleGAN, which was recently released with some improvements as StyleGAN3.

StyleGAN can generate believable images of faces, animals, or other subjects. With additional tools, the network can also manipulate these images. One example is StyleCLIP, which uses StyleGAN to generate and manipulate images based on text descriptions.

StyleGAN previously not suitable for videos - this is now changing

While generating and manipulating individual images with artificial intelligence can produce photo-realistic results, processing video remains a major challenge.

For example, individual images can be generated or manipulated and combined into a video. But the temporal coherence from image to image is missing: hairstyles shift, eyes suddenly look in a different direction or the lighting in the face changes.

In order to transfer the successes of GANs, for example in editing faces, to videos, GANs could theoretically be trained with videos - but the project fails simply because of the lack of high-quality videos of faces. Models like Nvidia StyleGAN require tens of millions of images for AI training.

New method uses StyleGAN for videos

As a new method from AI researchers at Tel Aviv University now shows, video training isn't necessary at all - at least for face manipulation in short video clips. Instead, the team relies on an extended StyleGAN architecture that exploits the temporal coherence present in the original video.

First, the AI system separates the video into individual frames for this purpose, from which the face is cut out and aligned horizontally. Then, a StyleGAN2 model with an e4e encoder generates a copy for each face within the network. The copies are then fine-tuned with the originals to correct inaccuracies and ensure coherence.

Video: Tzaban et al

Recommendation

AI in practice

AI safety alignment can make language models more deceptive, says Anthropic study

Next, the copies are edited as desired - a smile is added, a character is rejuvenated or aged. In the penultimate step, the resulting faces and their backgrounds are stitched together and finally merged into a new video.

Video: Tzaban et al

The results are impressive, as is the performance: a single video can be computed in about 1.5 hours on an Nvidia RTX 2080. The researchers still want to fix existing small errors, such as missing pigtails or unstable facial features, in the future, for example, with the use of StyleGAN3.

Video: Tzaban et al

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

More information, as well as examples and soon the code, are available on the project page of "Stitch it in Time".

Deepfakes are now even more versatile

StyleGAN previously not suitable for videos - this is now changing

New method uses StyleGAN for videos

AI safety alignment can make language models more deceptive, says Anthropic study

Read more about AI:

OpenAI DALL-E 2 Prompt Guide: How to use the generative AI model

AI significantly improves early detection of sepsis in hospitals

AI artwork wins art competition and artists are upset

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

Meta takes on OpenAI's GPT-4o with Llama 3 405B, its largest open-source LLM to date

AI models might need to scale down to scale up again

Deepfakes are now even more versatile

StyleGAN previously not suitable for videos - this is now changing

New method uses StyleGAN for videos

Read more about AI:

Share

Bank details