Meta brings Segment Anything to audio, letting editors pull sounds from video with a click or text prompt
Key Points
- Meta has introduced SAM Audio, an AI model capable of separating individual sound sources from audio mixes, with users able to control the process through text commands, clicking on video elements, or using timestamps.
- The system combines visual and audio information to selectively isolate voices, instruments, or background noises from real-world recordings.
- While Meta has developed new benchmarks and an automatic evaluation model to assess quality, the system currently lacks audio prompt functionality and still struggles to distinguish between similar-sounding sources.
After images and 3D models, Meta is bringing its "Segment Anything" approach to audio. The new AI model SAM Audio isolates individual sound sources from complex mixtures using text commands, time markers, or visual clicks.
Meta says the system is the first unified model that handles this task across different input methods. Instead of requiring separate tools for each use case, it responds flexibly to whatever type of command users throw at it.
The system offers three control methods that can be combined. Users can type text commands like "dog barking" or "singing voice" to isolate specific sounds. They can click directly on an object or person in a video to pull out the matching audio. Or they can use time markers—called span prompts—to pinpoint segments where a target sound occurs.
Meta sees potential applications in music production, podcasting, and film editing; filtering traffic noise from an exterior shot, for example, or separating instruments in a recording.
Perception Encoder bridges image and sound
SAM Audio runs on a generative modeling framework using a flow-matching diffusion transformer. This processes the audio mix alongside input commands to generate both the desired audio track and the residual sounds.
A key component is the Perception Encoder Audiovisual (PE-AV). This model builds on the Perception Encoder released in April and extends its computer vision capabilities to audio. Meta describes PE-AV as the "ears" that work with SAM Audio's "brain" to tackle complex segmentation tasks.
The system extracts features at the individual frame level and aligns them precisely with audio signals. This tight temporal sync lets SAM Audio separate sound sources that are visually anchored, like a speaker visible on screen. Without this synchronization, the model would lack the fine visual understanding needed to isolate sounds flexibly and accurately. Meta says PE-AV was trained on more than 100 million videos.
The model scales efficiently and comes in variants ranging from 500 million to 3 billion parameters. The developers say processing speed exceeds real-time.
New benchmarks for audio separation
Meta is rolling out two new tools to measure model performance: SAM Audio-Bench and SAM Audio Judge. Traditional audio separation metrics often require clean reference tracks for comparison, something that's rarely available in real-world scenarios.
SAM Audio Judge is an automatic evaluation model that assesses segmentation quality without a reference track. It's designed to mimic human perception, scoring criteria like fidelity and precision. This makes it particularly useful for benchmarks that aim to reflect actual listening experiences.
The SAM Audio Benchmark covers different audio domains including speech, music, and sound effects. Unlike previous datasets, it uses real audio and video sources instead of purely synthetic mixes, providing a more realistic foundation for evaluation.
Limitations and availability
SAM Audio doesn't accept audio files as prompts yet. Separating very similar sounds - like isolating a single singer from a choir or picking out one instrument from an orchestra - also remains a challenge, Meta says.
The model is available to try in the Segment Anything Playground, and Meta has published code and weights. The company is also partnering with Starkey, a US hearing aid manufacturer, and the startup incubator 2gether-International to explore accessibility applications.
Meta recently introduced SAM 3, the third generation of its segmentation model, which analyzes images and videos using open text prompts instead of rigid categories. The system features Promptable Concept Segmentation to flexibly isolate visual concepts. Alongside it, SAM 3D was released, reconstructing spatial objects and human poses from simple 2D images and expanding the AI's understanding of the physical world.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now