Content
summary Summary

Adobe Research and University of Michigan researchers have created an AI system that generates Foley sounds—the custom sound effects added to films and videos during post-production.

Ad

The system, called MultiFoley, lets users create sounds through text prompts, reference audio, or video examples. In demonstrations, the system transformed a cat's meow into a lion's roar and made typewriter sounds play like piano notes, all while maintaining precise synchronization with the video.

Video: Adobe

The system stands out for its ability to generate high-quality audio at 48kHz bandwidth. The researchers achieved this by training the AI on both internet videos and professional sound effect libraries.

Ad
Ad

MultiFoley is the first system to combine multiple input methods—text, audio, and video references—in a single model. It maintains tight synchronization between video and generated audio through a specialized mechanism that analyzes visual features at 8 frames per second, then scales them up to match the 40 Hz audio sampling rate.

Two pairs of images with spectrograms: on the left, a bird with singing patterns; on the right, a typewriter with mechanical sound patterns; each with three variations.
MultiFoley can generate any sound effect—from birdsong to typewriter clicks—and sync it with video footage by following simple text commands. | Image: Chen et al.

The system achieves average synchronization accuracy of 0.8 seconds, significantly better than previous systems that typically lagged by more than a second.

Testing shows major improvements in sound quality and timing

In tests against existing systems, MultiFoley showed superior performance in audio-video synchronization and matching generated sounds to text descriptions. A user study found that 85.8 percent of participants rated MultiFoley's semantic consistency better than the next-best system, while 94.5 percent preferred its synchronization.

Radar diagram: Comparison of 8 audio generation methods over 6 metrics (FAD@AUD, FAD@VGG, AV-Sync, CLAP, ImageBind, KLD), different colored polygon lines.
The radar chart compares eight different audio generation methods based on six performance metrics. MultiFoley (blue) outperforms in most cases. | Image: Chen et al.

The researchers note some current limitations. The system's training data was relatively small, limiting its range of sound effects. It also struggles with generating multiple simultaneous sounds.

The team plans to release the source code and models soon. While Adobe hasn't announced plans to add MultiFoley to its products, the technology would fit naturally alongside the AI capabilities already present in Adobe's Premiere Pro video editing software. The system could benefit individual creators as well as production companies looking to streamline their sound design process.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at the University of Michigan and Adobe Research have developed MultiFoley, an AI system that generates movie sounds based on text prompts, reference audio, or video samples.
  • MultiFoley can generate high-quality, full-bandwidth audio output and precisely synchronize the generated audio with video, achieving an average accuracy of 0.8 seconds.
  • In tests and user studies, MultiFoley surpassed existing systems in audio-video synchronization and semantic agreement, showing great potential for use in film production, game development, and other creative fields.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.