Content
summary Summary

One image, one audio track - and your lip-synchronized deepfake, almost indistinguishable from the real thing, is ready. Alibaba's latest AI model "EMO" is not yet available, but the implications are clear.

Researchers at Alibaba Group have developed a novel framework called EMO that improves the realism and expressiveness of talking video heads. EMO uses a direct audio-to-video synthesis approach based on Stable Diffusion, eliminating the need for intermediate 3D models or facial markers. The method ensures seamless transitions between frames and consistent character representation in the video.

Bild: Tian et al.

EMO can create both talking and singing videos in a variety of styles - from cartoon to anime to live action - and in different languages, significantly outperforming existing methods. Whether it's Audrey Hepburn, the Mona Lisa, or even the Japanese lady from a famous Sora demo video, a single image and soundtrack is all it takes to make lips and facial muscles move.

Video: Tian et al.

Ad
Ad

Video: Tian et al.

Video: Tian et al.

One of the biggest challenges for the researchers was the instability of the videos generated by the model, which often manifested itself in the form of facial distortion or jittering between video frames. To solve this problem, they integrated control mechanisms such as a velocity controller and a face area controller. These controls act as hyperparameters and serve as subtle control signals that do not affect the variety and expressiveness of the final generated videos.

Here, the researchers generated 14 different video clips based on the same eight-second audio clip. Image: Tian et al.

Custom training dataset with 250 hours of footage

EMO was trained on a custom audio-video dataset containing more than 250 hours of material and well over 150 million images. This extensive dataset covers a wide range of content, including speeches, movie and TV clips, and voice recordings in multiple languages. The researchers do not reveal the exact source of the data in their paper.

In extensive experiments and comparisons with the Human Dialogue Talking Face (HDTF) dataset, the EMO approach outperformed the current best competing methods on several metrics, including Fréchet Inception Distance (FID), SyncNet, F-SIM, and FVD.

Recommendation
Image: Tian et al.

While EMO represents a significant advance in the field of video generation, the researchers acknowledge some limitations. The method is more time-consuming than others that do not rely on diffusion models. Also, without explicit control signals to guide the figure's movement, the model may inadvertently generate other body parts. Future work will likely address these challenges.

It is not clear if and where Alibaba Group intends to apply this research and make it available to users. Such technologies offer an enormous potential for abuse if they can be used to create realistic deepfake videos in a very short time and, in combination with one of the numerous audio deepfake tools, to put all kinds of words into the mouths of politicians, for example. Especially in the US election year 2024, experts are very concerned about the risks of deepfakes.

But these developments are also exciting for entertainment and could be used in social media applications such as TikTok or in the film industry for more realistic dubbing.

Recently, several companies have been working on AI lip-syncing, including AI startup Pika Labs. The company, which offers a promising AI video model, just added lip-syncing to its product recently.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Another prominent player in the field is HeyGen, which made a splash last year with a convincing combination of voice cloning and lip-syncing.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba researchers have developed EMO, an AI model that can create realistic lip-synced videos from a single image and audio track. It outperforms existing methods in terms of realism and expressiveness.
  • EMO uses a direct audio-to-video synthesis approach based on Stable Diffusion and was trained on a custom dataset with over 250 hours of diverse video content. Control mechanisms help stabilize the generated videos.
  • While EMO has limitations and potential for misuse, such as creating convincing deepfakes, it also offers exciting possibilities for entertainment, social media, and dubbing. Other companies like Pika Labs and HeyGen are also working on similar AI lip-syncing technologies.
Sources
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.