Content
summary Summary

Stability AI and Arm have optimized the Stable Audio Open model to run on phone processors, enabling offline audio generation directly on mobile devices.

Ad

Stable Audio Open, released in summer 2024, generates up to 47 seconds of audio from text prompts. The model specializes in short-form audio like drum beats, instrumental riffs, ambient sounds and Foley recordings. Unlike the commercial Stable Audio 2, it isn't designed for creating complete songs like services such as Suno.

The initial version of Stable Audio Open took 240 seconds to generate audio on Arm CPUs. Through model distillation and Arm's software stack, generation time dropped to under 8 seconds for an 11-second clip on Armv9 processors - a 30x speed improvement.

The implementation uses Arm's KleidiAI libraries to process audio generation tasks on device processors without requiring an internet connection. Stability AI's blog post doesn't detail the technical specifics, and no research paper has been published yet. The optimization makes the model accessible to anyone with a compatible ARM-based mobile device.

Ad
Ad

Stability AI intends to port its image, video and 3D generation models to mobile devices using the Arm partnership. This focus on mobile development differs from the company's previous strategy of frequent Stable Diffusion image model releases. The London-based startup appointed a new CEO in June 2024 amid financial difficulties and staff departures.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Stability AI and Arm have adapted the Stable Audio Open AI model for use on mobile phones, allowing users to quickly create audio clips and sound effects directly on their device, even offline.
  • By simplifying the model and utilizing Arm's specialized software, the companies drastically reduced the processing time. Generating an 11-second audio clip now takes less than 8 seconds on Armv9 CPUs, compared to the previous 240 seconds.
  • Stability AI intends to extend this approach to its other advanced image, video, and 3D generation tools.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.