Content
summary Summary

Stability AI and Arm have released a compact text-to-audio model that runs on smartphones, capable of generating stereo audio clips up to 11 seconds long in about 7 seconds.

Ad

Called Stable Audio Open Small, the model is based on a technique known as "Adversarial Relativistic-Contrastive" (ARC), developed by researchers at the University of California, Berkeley and others. On high-end hardware like an Nvidia H100 GPU, it can produce 44 kHz stereo audio in just 75 milliseconds—fast enough for near real-time generation

The original version of Stable Audio Open launched last year as a free, open-source model with 1.1 billion parameters. This smaller version uses just 341 million parameters, making it significantly easier to run on consumer hardware. Stability AI and Arm first announced their collaboration in March.

Designed for mobile hardware

To make the model work on smartphones, the team overhauled the architecture. The system now consists of three components: an autoencoder that compresses the audio data, an embedding module that interprets text prompts, and a diffusion model that generates the final audio.

Ad
Ad

This redesigned setup doesn't rely on distillation, but still cuts memory usage nearly in half—from 6.5 GB down to 3.6 GB. That reduction makes it possible to run the model on mobile devices for the first time. During testing, researchers used the Vivo X200 Pro, an Android phone with 12 GB of RAM and a Mediatek Dimensity 9400 chip, released in late 2024.

Best suited for sound effects

Stability AI says the model is especially good at generating sound effects and field recordings. It still struggles with music, particularly with singing voices, and works best with English-language prompts.

The model was trained on roughly 472,000 clips from the Freesound database, using only material licensed under CC0, CC-BY, or CC-Sampling+ terms. To avoid copyright issues, the team filtered the data using a series of automated checks.

The software is available under the Stability AI Community License for open-source use. Commercial applications are subject to separate terms. The code is on GitHub, and model weights can be accessed via Hugging Face.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Stability AI and Arm have developed an AI model for smartphones that can generate stereo audio files up to eleven seconds long in around seven seconds.
  • The new "Stable Audio Open Small" model uses 341 million parameters and requires only 3.6 gigabytes of memory. Tests were carried out on a Vivo X200 Pro smartphone with 12 GB RAM.
  • The system was trained with 472,000 royalty-free audio recordings and is particularly suitable for sound effects, but still has limitations for music and vocals. The code is available as open source on GitHub.
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.