Stability AI releases a compact open text-to-audio model that runs on mobile devices

GPT-4o prompted by THE DECODER

Stability AI and Arm have released a compact text-to-audio model that runs on smartphones, capable of generating stereo audio clips up to 11 seconds long in about 7 seconds.

Called Stable Audio Open Small, the model is based on a technique known as "Adversarial Relativistic-Contrastive" (ARC), developed by researchers at the University of California, Berkeley and others. On high-end hardware like an Nvidia H100 GPU, it can produce 44 kHz stereo audio in just 75 milliseconds—fast enough for near real-time generation

The original version of Stable Audio Open launched last year as a free, open-source model with 1.1 billion parameters. This smaller version uses just 341 million parameters, making it significantly easier to run on consumer hardware. Stability AI and Arm first announced their collaboration in March.

Designed for mobile hardware

To make the model work on smartphones, the team overhauled the architecture. The system now consists of three components: an autoencoder that compresses the audio data, an embedding module that interprets text prompts, and a diffusion model that generates the final audio.

This redesigned setup doesn't rely on distillation, but still cuts memory usage nearly in half—from 6.5 GB down to 3.6 GB. That reduction makes it possible to run the model on mobile devices for the first time. During testing, researchers used the Vivo X200 Pro, an Android phone with 12 GB of RAM and a Mediatek Dimensity 9400 chip, released in late 2024.

Best suited for sound effects

Stability AI says the model is especially good at generating sound effects and field recordings. It still struggles with music, particularly with singing voices, and works best with English-language prompts.

The model was trained on roughly 472,000 clips from the Freesound database, using only material licensed under CC0, CC-BY, or CC-Sampling+ terms. To avoid copyright issues, the team filtered the data using a series of automated checks.

The software is available under the Stability AI Community License for open-source use. Commercial applications are subject to separate terms. The code is on GitHub, and model weights can be accessed via Hugging Face.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Stability AI releases a compact open text-to-audio model that runs on mobile devices

Designed for mobile hardware

Best suited for sound effects

Japanese startup Sakana AI explores time-based thinking with brain-inspired AI model

MIT says a high-profile AI productivity study used data that cannot be trusted

Meta's Behemoth AI model delay signals struggles to match new paradigms

OpenAI launches Codex: Autonomous AI agents for software development

AlphaEvolve is Google DeepMind's new AI system that autonomously creates better algorithms

US Copyright Office says fair use does not cover AI trained on "vast troves of copyrighted works

Stability AI releases a compact open text-to-audio model that runs on mobile devices

Designed for mobile hardware

Best suited for sound effects

Japanese startup Sakana AI explores time-based thinking with brain-inspired AI model

MIT says a high-profile AI productivity study used data that cannot be trusted

Meta's Behemoth AI model delay signals struggles to match new paradigms