Microsoft's VASA-1 generates lifelike avatars in real-time

Microsoft researchers have developed a process called VASA-1 that generates lifelike speaking faces for virtual AI avatars from photos and audio clips. The generated videos show natural mouth movements, facial expressions, and head movements in real-time.

Microsoft researchers have developed a new way to generate audio-driven talking faces with realistic lip synchronization, vivid facial expressions, and natural head movements. The system, called VASA-1, requires only a single photo and an audio file as input.

The model is trained on a large amount of face video data. After training, an encoder processes the input image and captures various aspects such as identity, 3D appearance, and facial dynamics. A diffusion model then generates the face and head movements. A decoder converts these movements into the final video frames representing the speaking face.

In experiments with two data sets, VASA-1 significantly outperforms previous methods in terms of audio synchronization of lip and head movements and video quality. The method delivers 512x512 pixel videos at up to 40 fps in real-time - with a latency of only 170 ms on an Nvidia RTX 4090 GPU.

Video: Microsoft

Control signals can be used to control viewing direction, head distance, and emotion.

Video: Microsoft

Microsoft keeps VASA-1 under wraps

Microsoft researchers see VASA-1 as an important step toward creating lifelike digital AI avatars that can interact with people as naturally and intuitively as real people. Potential applications range from enriching digital communication and accessibility to interactive AI tutoring and therapeutic support.

Recommendation

AI research

Distilling multi-step "System 2" reasoning into AI language models fails at Chain of Thought

Despite the potential for positive applications, the researchers also warn of possible abuses of the technology, such as creating misleading content with real people. As a result, Microsoft will not release VASA-1, but plans to expand the method to include the upper body, a more expressive 3D face model, non-rigid elements such as hair and clothing, and more expressive speech styles and emotions.

More examples can be found on the VASA-1 project page.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Microsoft's VASA-1 generates lifelike avatars in real-time

Microsoft keeps VASA-1 under wraps

Distilling multi-step "System 2" reasoning into AI language models fails at Chain of Thought

Microsoft’s MAI-DxO boosts AI diagnostic accuracy and cuts costs by nearly 70 percent

Microsoft’s Braga AI chip faces six-month delay, trails Nvidia’s Blackwell

Microsoft has introduced an AI agent to the Windows Settings menu

"Cat attack" on reasoning model shows how important context engineering is

Apple's claims about large reasoning models face fresh scrutiny from a new study

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

Microsoft's VASA-1 generates lifelike avatars in real-time

Microsoft keeps VASA-1 under wraps

Share

Bank details