Meta develops AI system MoCha that turns text into animated characters

Meta and University of Waterloo researchers have built MoCha, an AI system that generates complete character animations with synchronized speech and natural movements.

Unlike previous models that focused only on faces, MoCha can render full-body movements from various camera angles, including lip synchronization, gestures, and interactions between multiple characters. Early demonstrations focus on close-up and semi-close-up shots, where the system generates upper body movements and gestures that align with spoken dialogue.

Video: Wei et al.

MoCha runs on a diffusion transformer model with 30 billion parameters. It produces HD video clips around five seconds long at 24 frames per second, putting it on par with current video generation models.

Improving lip sync accuracy

The system introduces a "Speech-Video Window Attention" mechanism to solve two persistent challenges in AI video generation: video compression during processing while audio stays at full resolution, and mismatched lip movements during parallel video generation.

The system achieves this by limiting each frame's access to a specific window of audio data. This approach reflects how human speech works - lip movements depend on immediate sounds, while body language follows broader patterns in the text. Adding tokens before and after each frame's audio helps create smoother transitions and more accurate lip synchronization.

MoCha generates photorealistic videos with faces, gestures, and lip movements from text descriptions. | Bild: Wei et al.

The researchers built the system using 300 hours of carefully filtered video content, though they haven't disclosed the source material. They supplemented this with text-based video sequences to expand the range of possible expressions and interactions.

Managing multiple characters

For scenes with multiple characters, the team developed a streamlined prompt system. Users can define characters once and refer to them with simple tags like 'Person1' or 'Person2' throughout different scenes, eliminating the need for repeated descriptions.

Diagram explaining the MoCha architecture for multi-character conversations from videos using annotated prompts, self-attention and audio conditioning. — The researchers developed a prompt template that lets users reference specific characters without repeated descriptions. | Image: Wei et al.

In tests across 150 different scenarios, MoCha outperformed similar systems in both lip synchronization and natural movement quality. Independent evaluators rated the generated videos as realistic.

Recommendation

AI research

Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

Test results show MoCha performing better than competing systems across all metrics. | Image: Wei et al.

According to the research team, MoCha shows promise for applications ranging from digital assistants and virtual avatars to advertising and educational content. Meta hasn't revealed whether the system will become open source or remain a research prototype.

The timing of MoCha's development is notable, as major social media companies race to advance AI-powered video technology. Meta recently launched MovieGen, while TikTok parent company ByteDance has been developing its own suite of AI animation systems - including INFP, OmniHuman-1, and Goku.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Meta develops AI system MoCha that turns text into animated characters

Improving lip sync accuracy

Managing multiple characters

Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

Meta's new AI model "Movie Gen" brings text to life with video, image, and audio generation

Sony Pictures wants to use generative AI to cut movie production costs

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Meta develops AI system MoCha that turns text into animated characters

Improving lip sync accuracy

Managing multiple characters

Scaling laws for precision: AI researcher sees "perfect storm" for the end of scale

Meta's new AI model "Movie Gen" brings text to life with video, image, and audio generation

Sony Pictures wants to use generative AI to cut movie production costs