Google's Deepmind unveils V2A, an AI that adds realistic audio to any video

Deepmind's Video-to-Audio (V2A) technology combines video pixels with text prompts to generate audio tracks with dialogue, sound effects, and music for silent videos.

Google Deepmind has introduced a generative AI model that generates audio for video (Video-to-Audio, V2A). V2A technology combines video pixels with natural language instructions to generate detailed audio tracks for silent videos.

V2A can be used in combination with video generation models such as Deepmind's Veo or those from competitors such as Sora, KLING, or Gen 3 to add dramatic music, realistic sound effects, or dialogue to match the characters and mood of the video. Of course, the technology can also be used to add sound to traditional footage such as archival footage and silent films. Its power, of course, lies in the ability to create an unlimited number of soundtracks for each video input.

Additional control is provided by optional positive prompts that steer the output toward desired sounds, while negative prompts prevent unwanted sounds. This is also common in image generation.

Prompt for audio: Cars skidding, car engine throttling, angelic electronic music

V2A model not available for the time being

Deepmind's model is diffusion-based, which the team says provides the most realistic and convincing results for synchronizing video and audio.

The V2A system first encodes the video input into a compressed representation. Then the diffusion model gradually refines the audio from random noise, guided by the visual input and text prompts. Finally, the audio output is decoded, converted to an audio waveform, and combined with the video data.

To improve audio quality, Deepmind added additional information to the training process, including AI-generated descriptions of sounds and transcriptions of spoken dialog. In this way, V2A learns to associate certain audio events with different visual scenes and to respond to the information contained in the descriptions or transcripts.

Recommendation

AI research

Apple's local AI agent framework paves the way for more useful Apple Intelligence

However, there are some limitations: For example, the quality of the audio output depends on the quality of the video input. Artifacts or distortions in the video that lie outside the model's training distribution can cause significant degradation in audio quality. Lip sync in video with speech is also still erratic.

V2A is not yet available - Deepmind is gathering feedback from leading creatives and filmmakers to ensure V2A "can have a positive impact on the creative community." According to the company, before wider access is considered, V2A will undergo rigorous safety assessments and testing.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Google's Deepmind unveils V2A, an AI that adds realistic audio to any video

V2A model not available for the time being

Apple's local AI agent framework paves the way for more useful Apple Intelligence

Google Deepmind's Aeneas AI helps historians quickly restore and interpret Roman inscriptions

OpenAI beats Deepseek by a surprisingly wide margin in Google's latest reasoning benchmark

Google develops AI research assistant to accelerate scientific discoveries

Anthropic confirms technical bugs after weeks of complaints about declining Claude code quality

Anthropic settles landmark AI copyright lawsuit for at least $1.5 billion

Microsoft presents its first large AI models and signals greater independence from OpenAI

Google's Deepmind unveils V2A, an AI that adds realistic audio to any video

V2A model not available for the time being

Share

Bank details