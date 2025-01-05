AI research
Jonathan Kemper

ByteDance's new AI model brings still images to life with audio

Bytedance
ByteDance's new AI model brings still images to life with audio
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Profile
Content
summary Summary

TikTok's parent company ByteDance has developed an AI system called INFP that can make static portrait photos appear to speak and react to audio input.

Ad

What sets INFP (which stands for "Interactive, Natural, Flash and Person-generic") apart is its ability to create realistic conversation videos between two people without needing anyone to manually assign speaking and listening roles. The system figures out these roles automatically as the conversation flows.

The system works in two main steps. In the first step, which ByteDance calls "Motion-Based Head Imitation," the AI learns to pick up on all the little details of how people communicate - things like facial expressions and the way they move their heads during conversations. It takes these movements from videos and turns them into data that it can use later. This motion data can then animate a still photo to match the original person's movements.

In the second stage, "audio-guided motion generation," the system figures out how to match sounds with natural-looking movements. The team developed what they call a "motion guider" that creates patterns for both speaking and listening by analyzing the audio from both sides of a conversation. Then, a special AI component called a diffusion transformer takes these patterns and gradually refines them into smooth, realistic movements that match the audio.

Ad
Ad
Technical diagram: two-stage AI architecture for audio-controlled facial animation with motion-based head imitation and audio-guided motion generation.
In the first step, movement patterns are extracted from conversation videos; in the second step, these movements are controlled by audio input. | Image: Bytedance

Teaching AI how real conversations work

To train their system properly, the team had to build something new: a collection of real-world conversations they called DyConv. They gathered over 200 hours of people talking to each other from videos across the internet.

While there are other conversation databases out there, like ViCo and RealTalk, the team says DyConv offers something special - it captures a wider range of human emotions and expressions, and the video quality is notably better.

Image sequences with audio waveforms: Three rows of recording frames show different expressions during speech synthesis with DIM, GT and own method.
The black and blue audio waveforms represent the speaking parts of the conversation partners, while the image sequences show the corresponding generated motion patterns. | Image: Bytedance

ByteDance says its system outperforms existing tools in several key areas. INFP is particularly good at matching lip movements to speech, preserving the person's unique facial features, and creating a wide variety of natural-looking movements. The team also found that it works well for creating videos of someone just listening to a conversation.

What's next for INFP

Currently, INFP only works with audio, but the team sees many ways to expand its capabilities. They're exploring how to make the system work with images and text, which would open up all kinds of new possibilities. Their next goal is to create realistic animations of people's entire bodies, not just their heads and facial expressions.

The researchers know this kind of technology could be misused to create fake videos and spread false information. That's why they're planning to keep the core technology limited to research institutions - much like Microsoft did last summer with their advanced voice cloning system.

Recommendation
AI research

Google's Mixture-of-Depths uses computing power more efficiently by prioritizing key tokens

This work is just one piece of ByteDance's bigger AI strategy, which they announced earlier this year. With popular apps like TikTok and CapCut in their portfolio, the company has a massive platform for putting these AI innovations to use.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Bytedance, the parent company of TikTok, have developed an AI framework called INFP that can generate lifelike dialog videos for conversations between two people without having to manually assign the roles of "listener" and "speaker".
  • INFP consists of two steps: In the first step, the model learns communicative behavior from videos. In the second step, the model learns to map audio input to movements.
  • To support the research, the authors have also compiled the DyConv dataset, which contains more than 200 hours of high-quality dialog videos.
Sources
Arxiv GitHub
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Profile
AI research

Google finds new way to train AI models using smaller 'teacher' models

News, tests and reports about VR, AR and MIXED Reality.
Meta Quest: Wooorld now offers an immersive Google Earth VR experience Hands-On: Walk the Plank is Richie's Plank Experience with more realistic graphics 2025 could be VR gaming’s biggest year yet, if industry leaders let it MIXED-NEWS.com
AI research

New RAG system RetroLLM is more efficient and accurate than previous solutions

AI research

Snap's new SnapGen AI can create high-res images in seconds on your phone

Google News
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

ByteDance's new AI model brings still images to life with audio

Bank details

IBAN: DE87 1203 0000 1086 0070 75
Account holder: DEEP CONTENT GbR
Purpose: Support THE DECODER
AI in practice

The great AI scaling debate continues into 2025

AI research

Deepseek's $5.6M Chinese LLM wonder shakes up the AI elite

AI in practice

OpenAI unveils o3, its most advanced reasoning model yet

Google News