Stability AI and Arm have released a compact text-to-audio model that runs on smartphones, capable of generating stereo audio clips up to 11 seconds long in about 7 seconds.
Called Stable Audio Open Small, the model is based on a technique known as "Adversarial Relativistic-Contrastive" (ARC), developed by researchers at the University of California, Berkeley and others. On high-end hardware like an Nvidia H100 GPU, it can produce 44 kHz stereo audio in just 75 milliseconds—fast enough for near real-time generation
The original version of Stable Audio Open launched last year as a free, open-source model with 1.1 billion parameters. This smaller version uses just 341 million parameters, making it significantly easier to run on consumer hardware. Stability AI and Arm first announced their collaboration in March.
Designed for mobile hardware
To make the model work on smartphones, the team overhauled the architecture. The system now consists of three components: an autoencoder that compresses the audio data, an embedding module that interprets text prompts, and a diffusion model that generates the final audio.
This redesigned setup doesn't rely on distillation, but still cuts memory usage nearly in half—from 6.5 GB down to 3.6 GB. That reduction makes it possible to run the model on mobile devices for the first time. During testing, researchers used the Vivo X200 Pro, an Android phone with 12 GB of RAM and a Mediatek Dimensity 9400 chip, released in late 2024.
Best suited for sound effects
Stability AI says the model is especially good at generating sound effects and field recordings. It still struggles with music, particularly with singing voices, and works best with English-language prompts.
The model was trained on roughly 472,000 clips from the Freesound database, using only material licensed under CC0, CC-BY, or CC-Sampling+ terms. To avoid copyright issues, the team filtered the data using a series of automated checks.
The software is available under the Stability AI Community License for open-source use. Commercial applications are subject to separate terms. The code is on GitHub, and model weights can be accessed via Hugging Face.