Content
summary Summary

TikTok's parent company ByteDance has developed an AI system called INFP that can make static portrait photos appear to speak and react to audio input.

Ad

What sets INFP (which stands for "Interactive, Natural, Flash and Person-generic") apart is its ability to create realistic conversation videos between two people without needing anyone to manually assign speaking and listening roles. The system figures out these roles automatically as the conversation flows.

The system works in two main steps. In the first step, which ByteDance calls "Motion-Based Head Imitation," the AI learns to pick up on all the little details of how people communicate - things like facial expressions and the way they move their heads during conversations. It takes these movements from videos and turns them into data that it can use later. This motion data can then animate a still photo to match the original person's movements.

In the second stage, "audio-guided motion generation," the system figures out how to match sounds with natural-looking movements. The team developed what they call a "motion guider" that creates patterns for both speaking and listening by analyzing the audio from both sides of a conversation. Then, a special AI component called a diffusion transformer takes these patterns and gradually refines them into smooth, realistic movements that match the audio.

Ad
Ad
Technical diagram: two-stage AI architecture for audio-controlled facial animation with motion-based head imitation and audio-guided motion generation.
In the first step, movement patterns are extracted from conversation videos; in the second step, these movements are controlled by audio input. | Image: Bytedance

Teaching AI how real conversations work

To train their system properly, the team had to build something new: a collection of real-world conversations they called DyConv. They gathered over 200 hours of people talking to each other from videos across the internet.

While there are other conversation databases out there, like ViCo and RealTalk, the team says DyConv offers something special - it captures a wider range of human emotions and expressions, and the video quality is notably better.

Image sequences with audio waveforms: Three rows of recording frames show different expressions during speech synthesis with DIM, GT and own method.
The black and blue audio waveforms represent the speaking parts of the conversation partners, while the image sequences show the corresponding generated motion patterns. | Image: Bytedance

ByteDance says its system outperforms existing tools in several key areas. INFP is particularly good at matching lip movements to speech, preserving the person's unique facial features, and creating a wide variety of natural-looking movements. The team also found that it works well for creating videos of someone just listening to a conversation.

What's next for INFP

Currently, INFP only works with audio, but the team sees many ways to expand its capabilities. They're exploring how to make the system work with images and text, which would open up all kinds of new possibilities. Their next goal is to create realistic animations of people's entire bodies, not just their heads and facial expressions.

The researchers know this kind of technology could be misused to create fake videos and spread false information. That's why they're planning to keep the core technology limited to research institutions - much like Microsoft did last summer with their advanced voice cloning system.

Recommendation

This work is just one piece of ByteDance's bigger AI strategy, which they announced earlier this year. With popular apps like TikTok and CapCut in their portfolio, the company has a massive platform for putting these AI innovations to use.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Bytedance, the parent company of TikTok, have developed an AI framework called INFP that can generate lifelike dialog videos for conversations between two people without having to manually assign the roles of "listener" and "speaker".
  • INFP consists of two steps: In the first step, the model learns communicative behavior from videos. In the second step, the model learns to map audio input to movements.
  • To support the research, the authors have also compiled the DyConv dataset, which contains more than 200 hours of high-quality dialog videos.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.