Microsoft's VASA-1 generates lifelike avatars in real-time
Key Points
- Microsoft researchers have developed VASA-1, a method that uses a single photo and audio file to generate videos of speaking faces with natural mouth movements, facial expressions, and head movements in real-time.
- The model was trained on a large amount of facial video data and, in experiments, significantly outperformed previous methods in terms of audio synchronization of lip and head movements and video quality. On an Nvidia RTX 4090 GPU, it delivers 512x512 pixel videos with up to 40 FPS and a latency of just 170ms.
- Microsoft researchers see VASA-1 as an important step toward lifelike digital AI avatars for a wide range of applications, but also warn of potential abuse. Therefore, Microsoft will not release VASA-1 - but plans further improvements.
Microsoft researchers have developed a process called VASA-1 that generates lifelike speaking faces for virtual AI avatars from photos and audio clips. The generated videos show natural mouth movements, facial expressions, and head movements in real-time.
Microsoft researchers have developed a new way to generate audio-driven talking faces with realistic lip synchronization, vivid facial expressions, and natural head movements. The system, called VASA-1, requires only a single photo and an audio file as input.
The model is trained on a large amount of face video data. After training, an encoder processes the input image and captures various aspects such as identity, 3D appearance, and facial dynamics. A diffusion model then generates the face and head movements. A decoder converts these movements into the final video frames representing the speaking face.
In experiments with two data sets, VASA-1 significantly outperforms previous methods in terms of audio synchronization of lip and head movements and video quality. The method delivers 512x512 pixel videos at up to 40 fps in real-time - with a latency of only 170 ms on an Nvidia RTX 4090 GPU.
Video: Microsoft
Control signals can be used to control viewing direction, head distance, and emotion.
Video: Microsoft
Video: Microsoft
Microsoft keeps VASA-1 under wraps
Microsoft researchers see VASA-1 as an important step toward creating lifelike digital AI avatars that can interact with people as naturally and intuitively as real people. Potential applications range from enriching digital communication and accessibility to interactive AI tutoring and therapeutic support.
Despite the potential for positive applications, the researchers also warn of possible abuses of the technology, such as creating misleading content with real people. As a result, Microsoft will not release VASA-1, but plans to expand the method to include the upper body, a more expressive 3D face model, non-rigid elements such as hair and clothing, and more expressive speech styles and emotions.
More examples can be found on the VASA-1 project page.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now