Content
summary Summary

Bytedance has unveiled DreamActor-M1, a new AI system that gives users precise control over facial expressions and body movements in generated videos.

Ad

The system uses what the company calls "hybrid guidance" - a combination of multiple control signals working together. DreamActor-M1's architecture has three main components. At its core is a facial encoder that can modify expressions independently from a person's identity or head position. According to Bytedance researchers, this solves a common limitation in previous systems.

The demo shows facial expressions and audio from one video being mapped onto both an animated character and a real person. | Video: Bytedance

The system manages head movements through a 3D model using colored spheres to direct gaze and head orientation. For body motion, it employs a 3D skeleton system with an adaptive layer that adjusts for different body types to create more natural movement.

Ad
Ad
Detailliertes Schaubild des DreamActor-M1-Systems. Links sind Videoframes mit einer tanzenden Person zu sehen, die als Eingabe dienen. Im mittleren Bereich werden drei parallele Verarbeitungspfade dargestellt: Pose-Estimation (oben), Face-Tracker (Mitte) und Face-Alignment (unten). Diese werden in verschiedene Latent-Darstellungen kodiert und durch Diffusions-Transformer-Blöcke (DiT) verarbeitet. Rechts ist die Architektur eines DiT-Blocks mit den Aufmerksamkeitsmechanismen Self-Attention, Reference-Attention und Face-Attention dargestellt
The system processes body movements and facial expressions separately before combining them in a diffusion transformer to create more lifelike animations. | Image: Bytedance

During the training phase, the model learns from images at various angles. The researchers say this allows it to generate new viewpoints even from a single portrait, filling in missing details like clothing and pose intelligently.

Übersichtsdiagramm: Pipeline zur Inferenz generativer KI für Videosynthese animierter Menschen aus Steuersignalen und Referenzen.
DreamActor-M1 creates multiple views from one reference image, processes facial and body movements separately, then combines them to produce the final animated video. | Image: Bytedance

Training happens in three stages: first the model works on basic body and head movement, then it adds precisely controlled facial expressions, and finally it optimizes everything together for more coordinated results. Bytedance says the model was trained on 500 hours of video, with equal parts full-body and upper-body footage.

According to the researchers, DreamActor-M1 outperforms similar systems in both visual quality and motion control precision, including commercial products like Runway Act-One.

Video: Bytedance

The system does have limitations. It cannot handle dynamic camera movements, object interactions, or extreme differences in body proportions between source and target. Complex scene transitions also remain challenging.

Recommendation

Bytedance, which owns TikTok, is developing several AI avatar animation projects simultaneously. Earlier this year, the company launched OmniHuman-1, which is already available as a lip-sync tool on CapCut's Dreamina platform, showing how quickly Bytedance can bring research to users. Other ongoing projects include the Goku video AI series and InfiniteYou portrait generator.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Bytedance is working on DreamActor-M1, an AI that generates videos of people based on single photos, using three separate modules to control facial expressions, head movement, and body posture.
  • The system is trained on 500 hours of video in three steps: learning movements, then facial expressions, and finally how to combine both, achieving better results in tests than similar models.
  • Despite its progress, the technology cannot yet handle camera motion, object interaction, or people with very different body shapes.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.