Microsoft's VASA-1 generates lifelike avatars in real-time

Microsoft researchers have developed a process called VASA-1 that generates lifelike speaking faces for virtual AI avatars from photos and audio clips. The generated videos show natural mouth movements, facial expressions, and head movements in real-time.

Microsoft researchers have developed a new way to generate audio-driven talking faces with realistic lip synchronization, vivid facial expressions, and natural head movements. The system, called VASA-1, requires only a single photo and an audio file as input.

The model is trained on a large amount of face video data. After training, an encoder processes the input image and captures various aspects such as identity, 3D appearance, and facial dynamics. A diffusion model then generates the face and head movements. A decoder converts these movements into the final video frames representing the speaking face.

In experiments with two data sets, VASA-1 significantly outperforms previous methods in terms of audio synchronization of lip and head movements and video quality. The method delivers 512x512 pixel videos at up to 40 fps in real-time - with a latency of only 170 ms on an Nvidia RTX 4090 GPU.

Video: Microsoft

Control signals can be used to control viewing direction, head distance, and emotion.

Video: Microsoft

Microsoft keeps VASA-1 under wraps

Microsoft researchers see VASA-1 as an important step toward creating lifelike digital AI avatars that can interact with people as naturally and intuitively as real people. Potential applications range from enriching digital communication and accessibility to interactive AI tutoring and therapeutic support.

Recommendation

AI research

Wait a minute! Researchers say AI's "chains of thought" are not signs of human-like reasoning

Despite the potential for positive applications, the researchers also warn of possible abuses of the technology, such as creating misleading content with real people. As a result, Microsoft will not release VASA-1, but plans to expand the method to include the upper body, a more expressive 3D face model, non-rigid elements such as hair and clothing, and more expressive speech styles and emotions.

More examples can be found on the VASA-1 project page.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Microsoft's VASA-1 generates lifelike avatars in real-time

Microsoft keeps VASA-1 under wraps

Wait a minute! Researchers say AI's "chains of thought" are not signs of human-like reasoning

Microsoft launches its first in-house image generation model, MAI-Image-1

Microsoft’s AI boss warns the illusion of conscious AI could trigger psychosis

Microsoft unveils Project Ire, an AI system that automatically detects malware

Anthropic claims to lower the entry barrier for advanced AI models with Claude Haiku 4.5

OpenAI says GPT-5 shows 30 percent less political bias than previous models

OpenAI suddenly remembers that copyright law exists after a few days of wild Sora videos

Microsoft's VASA-1 generates lifelike avatars in real-time

Microsoft keeps VASA-1 under wraps

Share

Bank details