Content
summary Summary

Deepmind's Video-to-Audio (V2A) technology combines video pixels with text prompts to generate audio tracks with dialogue, sound effects, and music for silent videos.

Ad

Google Deepmind has introduced a generative AI model that generates audio for video (Video-to-Audio, V2A). V2A technology combines video pixels with natural language instructions to generate detailed audio tracks for silent videos.

V2A can be used in combination with video generation models such as Deepmind's Veo or those from competitors such as Sora, KLING, or Gen 3 to add dramatic music, realistic sound effects, or dialogue to match the characters and mood of the video. Of course, the technology can also be used to add sound to traditional footage such as archival footage and silent films. Its power, of course, lies in the ability to create an unlimited number of soundtracks for each video input.

 

Ad
Ad

Additional control is provided by optional positive prompts that steer the output toward desired sounds, while negative prompts prevent unwanted sounds. This is also common in image generation.

Prompt for audio: Cars skidding, car engine throttling, angelic electronic music

V2A model not available for the time being

Deepmind's model is diffusion-based, which the team says provides the most realistic and convincing results for synchronizing video and audio.

The V2A system first encodes the video input into a compressed representation. Then the diffusion model gradually refines the audio from random noise, guided by the visual input and text prompts. Finally, the audio output is decoded, converted to an audio waveform, and combined with the video data.

To improve audio quality, Deepmind added additional information to the training process, including AI-generated descriptions of sounds and transcriptions of spoken dialog. In this way, V2A learns to associate certain audio events with different visual scenes and to respond to the information contained in the descriptions or transcripts.

Recommendation

However, there are some limitations: For example, the quality of the audio output depends on the quality of the video input. Artifacts or distortions in the video that lie outside the model's training distribution can cause significant degradation in audio quality. Lip sync in video with speech is also still erratic.

V2A is not yet available - Deepmind is gathering feedback from leading creatives and filmmakers to ensure V2A "can have a positive impact on the creative community." According to the company, before wider access is considered, V2A will undergo rigorous safety assessments and testing.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google Deepmind has developed a video-to-audio (V2A) AI model that can generate soundtracks of dialogue, sound effects, and music for silent videos by combining video pixels with text instructions.
  • V2A is based on a diffusion model and can be used in conjunction with video generation models to generate an unlimited number of soundtracks for videos. Text instructions can also be used to control the audio output.
  • The system first encodes the video, then the diffusion model gradually refines the audio from noise using the visual data and text prompts. However, the quality of the audio depends on the quality of the video, and lip synchronization is still imperfect. V2A is currently being tested and is not yet publicly available.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.