Content
summary Summary

Microsoft researchers have developed a process called VASA-1 that generates lifelike speaking faces for virtual AI avatars from photos and audio clips. The generated videos show natural mouth movements, facial expressions, and head movements in real-time.

Microsoft researchers have developed a new way to generate audio-driven talking faces with realistic lip synchronization, vivid facial expressions, and natural head movements. The system, called VASA-1, requires only a single photo and an audio file as input.

The model is trained on a large amount of face video data. After training, an encoder processes the input image and captures various aspects such as identity, 3D appearance, and facial dynamics. A diffusion model then generates the face and head movements. A decoder converts these movements into the final video frames representing the speaking face.

In experiments with two data sets, VASA-1 significantly outperforms previous methods in terms of audio synchronization of lip and head movements and video quality. The method delivers 512x512 pixel videos at up to 40 fps in real-time - with a latency of only 170 ms on an Nvidia RTX 4090 GPU.

Ad
Ad

Video: Microsoft

Control signals can be used to control viewing direction, head distance, and emotion.

Video: Microsoft

Video: Microsoft

Microsoft keeps VASA-1 under wraps

Microsoft researchers see VASA-1 as an important step toward creating lifelike digital AI avatars that can interact with people as naturally and intuitively as real people. Potential applications range from enriching digital communication and accessibility to interactive AI tutoring and therapeutic support.

Recommendation

Despite the potential for positive applications, the researchers also warn of possible abuses of the technology, such as creating misleading content with real people. As a result, Microsoft will not release VASA-1, but plans to expand the method to include the upper body, a more expressive 3D face model, non-rigid elements such as hair and clothing, and more expressive speech styles and emotions.

More examples can be found on the VASA-1 project page.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft researchers have developed VASA-1, a method that uses a single photo and audio file to generate videos of speaking faces with natural mouth movements, facial expressions, and head movements in real-time.
  • The model was trained on a large amount of facial video data and, in experiments, significantly outperformed previous methods in terms of audio synchronization of lip and head movements and video quality. On an Nvidia RTX 4090 GPU, it delivers 512x512 pixel videos with up to 40 FPS and a latency of just 170ms.
  • Microsoft researchers see VASA-1 as an important step toward lifelike digital AI avatars for a wide range of applications, but also warn of potential abuse. Therefore, Microsoft will not release VASA-1 - but plans further improvements.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.