Deepmind's Video-to-Audio (V2A) technology combines video pixels with text prompts to generate audio tracks with dialogue, sound effects, and music for silent videos.
Google Deepmind has introduced a generative AI model that generates audio for video (Video-to-Audio, V2A). V2A technology combines video pixels with natural language instructions to generate detailed audio tracks for silent videos.
V2A can be used in combination with video generation models such as Deepmind's Veo or those from competitors such as Sora, KLING, or Gen 3 to add dramatic music, realistic sound effects, or dialogue to match the characters and mood of the video. Of course, the technology can also be used to add sound to traditional footage such as archival footage and silent films. Its power, of course, lies in the ability to create an unlimited number of soundtracks for each video input.
Additional control is provided by optional positive prompts that steer the output toward desired sounds, while negative prompts prevent unwanted sounds. This is also common in image generation.
Prompt for audio: Cars skidding, car engine throttling, angelic electronic music
V2A model not available for the time being
Deepmind's model is diffusion-based, which the team says provides the most realistic and convincing results for synchronizing video and audio.
The V2A system first encodes the video input into a compressed representation. Then the diffusion model gradually refines the audio from random noise, guided by the visual input and text prompts. Finally, the audio output is decoded, converted to an audio waveform, and combined with the video data.
To improve audio quality, Deepmind added additional information to the training process, including AI-generated descriptions of sounds and transcriptions of spoken dialog. In this way, V2A learns to associate certain audio events with different visual scenes and to respond to the information contained in the descriptions or transcripts.
However, there are some limitations: For example, the quality of the audio output depends on the quality of the video input. Artifacts or distortions in the video that lie outside the model's training distribution can cause significant degradation in audio quality. Lip sync in video with speech is also still erratic.
V2A is not yet available - Deepmind is gathering feedback from leading creatives and filmmakers to ensure V2A "can have a positive impact on the creative community." According to the company, before wider access is considered, V2A will undergo rigorous safety assessments and testing.