ByteDance's AI can animate both real people and cartoon characters from a single image

Feb 4, 2025

ByteDance

Researchers at TikTok parent company ByteDance have presented OmniHuman-1, a new framework for generating videos from image and audio samples.

The new system from TikTok's parent company turns still images into videos by adding movement and speech. A demonstration shows Nvidia CEO Jensen Huang appearing to sing, highlighting both the system's capabilities and potential risks.

https://www.youtube.com/watch?v=XF5vOR7Bpzs

ByteDance researchers developed OmniHuman-1 to solve a key challenge in AI video generation: creating natural human movements at scale. Previous systems struggled when given more training data, since much of it contained irrelevant information that had to be filtered out, often losing valuable movement patterns in the process.

To address this, OmniHuman processes multiple types of input simultaneously - text, image, audio, and body poses. This approach allows the system to use more of its training data effectively. The researchers fed it about 19,000 hours of video material to learn from.

Architekturdiagramm: OmniHuman-Framework zeigt Trainingsablauf und DiT-Modell mit multimodalen Eingabemöglichkeiten für Text, Bild, Audio und Pose. — The OmniHuman framework combines a DiT-based model with a multi-stage training strategy. The architecture processes text, image, audio and pose data in parallel, while the training takes into account the complexity of the movement information. | Image: ByteDance

From still frames to fluid motion

The system first processes each input type separately, compressing movement information from text descriptions, reference images, audio signals, and movement data into a compact format. It then gradually refines this into realistic video output, learning to generate smooth motion by comparing its results with real videos.

Bildmatrix: Mehrere Reihen von Videosequenzen zeigen verschiedene Personen bei Präsentationen und Gesprächen mit natürlichen Bewegungen. — OmniHuman generates high-quality animations for a wide range of input formats, from portraits to full-body shots. | Image: ByteDance

The results show natural mouth movements and gestures that match the spoken content well. The system handles body proportions and environments better than previous models, the team reports.

Drei Vergleichstabellen: Quantitative Metriken für Portrait- und Körperanimation sowie subjektive Bewertungen verschiedener Audio-Trainingsmethoden. — In almost all quality and realism tests, OmniHuman-1 clearly outperforms previous methods. | Image: ByteDance

Beyond photos of real people, the system can also animate cartoon characters effectively.

Video: ByteDance

Theoretically unlimited AI videos

The length of generated videos isn't limited by the model itself, but by available memory. The project page shows examples ranging from five to 25 seconds.

This release follows ByteDance's recent introduction of INFP, a similar project focused on animating faces in conversations. With TikTok and video editor CapCut reaching massive user bases, ByteDance already implements AI features at scale. The company announced plans to focus heavily on AI development in February 2024.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

ByteDance's AI can animate both real people and cartoon characters from a single image

From still frames to fluid motion

Theoretically unlimited AI videos

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.