Microsoft researchers have developed a process called VASA-1 that generates lifelike speaking faces for virtual AI avatars from photos and audio clips. The generated videos show natural mouth movements, facial expressions, and head movements in real-time.
Microsoft researchers have developed a new way to generate audio-driven talking faces with realistic lip synchronization, vivid facial expressions, and natural head movements. The system, called VASA-1, requires only a single photo and an audio file as input.
The model is trained on a large amount of face video data. After training, an encoder processes the input image and captures various aspects such as identity, 3D appearance, and facial dynamics. A diffusion model then generates the face and head movements. A decoder converts these movements into the final video frames representing the speaking face.
In experiments with two data sets, VASA-1 significantly outperforms previous methods in terms of audio synchronization of lip and head movements and video quality. The method delivers 512x512 pixel videos at up to 40 fps in real-time - with a latency of only 170 ms on an Nvidia RTX 4090 GPU.
Control signals can be used to control viewing direction, head distance, and emotion.
Microsoft keeps VASA-1 under wraps
Microsoft researchers see VASA-1 as an important step toward creating lifelike digital AI avatars that can interact with people as naturally and intuitively as real people. Potential applications range from enriching digital communication and accessibility to interactive AI tutoring and therapeutic support.
Despite the potential for positive applications, the researchers also warn of possible abuses of the technology, such as creating misleading content with real people. As a result, Microsoft will not release VASA-1, but plans to expand the method to include the upper body, a more expressive 3D face model, non-rigid elements such as hair and clothing, and more expressive speech styles and emotions.
More examples can be found on the VASA-1 project page.