Stability AI's Stable Audio generates high-quality audio from simple text input

London-based startup Stability AI on Wednesday unveiled a new product called Stable Audio, which uses AI to generate custom music tracks and sound effects.

Stable Audio uses a diffusion-based AI model to create custom audio files in seconds from simple text input. Users specify genres, instruments, tuning, and other characteristics. The system then automatically composes matching songs, sound effects, or instrument stems.

Up to 90 seconds at 44.1 kHz

Stability AI tested the tool with inputs such as "Post-Rock, Guitars, Drum Kit, Bass, Strings, Euphoric, Up-Lifting, Moody, Flowing, Raw, Epic, Sentimental, 125 BPM". The result is a fast, atmospheric rock song at 125 BPM. According to Stability, this allows Stable Audio to create songs in a variety of genres, including ambient, techno, and trance.

Unlike previous AI-based music generators, Stable Audio seems to be able to produce musically coherent pieces in professional audio quality of 44.1 kHz over a longer period of up to 90 seconds.

Stable Audio can only be used in the browser, making it accessible to inexperienced users. The older model Disco Diffusion required some expertise and own fine-tuning. | Image: Ai4lunch/Twitter

The published samples sound authentic and hardly suggest that no human composers are behind it. On an Nvidia A100 GPU, 95 seconds of audio should be generated in less than a second, according to Stability AI.

The following sample songs and audio effects were generated using prompts such as "People talking in a crowded restaurant" or "Piano solo chord progression in major, upbeat 90 BPM".

You can listen to more Stable Audio demo songs here. Unfortunately, the servers are currently under heavy load, so you will need some luck to test the system yourself.

Artists get a share of the Stable Audio revenue

To achieve this quality, the system was trained on a music library provided by AudioSparx. AudioSparx partnered with Stability AI and promised the startup a cut of Stable Audio's revenue for using the approximately 800,000 songs, audio effects, and instrument snippets. In return, the creators of the songs used in the training can share in Stable Audio's profits through AudioSparx.

They were allegedly asked before the training if they wanted to make their songs available. This decision may be a response to the massive opposition Stability has faced in the copyright debate surrounding Stable Diffusion's training material.

Recommendation

AI in practice

OpenAI launches Codex: Autonomous AI agents for software development

According to Stability AI, users may use tracks created with Stable Audio for personal use free of charge. Commercial use requires a paid subscription. The company is targeting creative professionals, such as filmmakers or game developers, who need appropriate background music quickly.

Stability AI also plans to release an open-source music model trained on different datasets

Stable Audio differs from Stable Diffusion in that it is not open source, unlike the popular image model. However, the FAQ states that an open-source model trained on other data will be released soon.

The basis for Stable Audio is the text-to-music model Dance Diffusion, which was released by Harmonai in 2022 with support from Stability. However, Stable Audio is a model developed from scratch by the audio division of Stability Al, which was founded in April.

Using diffusion models for music is not a new idea. However, Stable Audio's strength lies in its ability to produce pieces of varying lengths, he said. This was taken into account during training, he said.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Stability AI explains the underlying technique this way:

Stable Audio is a latent diffusion model with several parts: a Variational Autoencoder (VAE), a text encoder, and a U-net-based diffusion model.
The VAE compresses stereo audio into a lossy, noise-resistant, and invertible latent coding, allowing faster generation and training.
A frozen text encoder of a newly trained CLAP model is used for text prompts.
Timing embeddings are computed during training and used to control the output audio length.
The diffusion model for Stable Audio is a 907 million parameter U-net based on the Moûsai model.

You can use Stable Audio exclusively through the recently launched web interface. 20 songs per month of up to 45 seconds are free for personal use. For $11.99 per month, you get 500 songs with up to 90 seconds of playback time and a commercial license.

No content filter could lead to easy plagiarism

The tool could also be used to fake songs by popular artists. So far, labels have been able to successfully fight such AI creations, but the legal situation is still unclear.

Stability AI itself insists in an interview with Techcrunch that it wants to use the technology responsibly. AudioSparx's database does not contain pop songs, but many that are labeled as such in the style of well-known artists. Unlike Google's MusicLM, the names of famous artists are not blocked, at least not yet.

It remains to be seen whether Stable Audio will pay off for Stability AI's business model, which has so far been in the red. In any case, the impressive quality of the AI compositions makes you sit up and take notice.

Stability AI's Stable Audio generates high-quality audio from simple text input

Up to 90 seconds at 44.1 kHz

Artists get a share of the Stable Audio revenue

OpenAI launches Codex: Autonomous AI agents for software development

Stability AI also plans to release an open-source music model trained on different datasets

No content filter could lead to easy plagiarism

Sony Music gives deadline to more than 700 AI companies over unlicensed training data

US artists' union reaches agreement with major record labels on new AI rules

Udio is the latest AI music generator and you can generate 1200 free songs per month

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Stability AI's Stable Audio generates high-quality audio from simple text input

Up to 90 seconds at 44.1 kHz

Artists get a share of the Stable Audio revenue

Stability AI also plans to release an open-source music model trained on different datasets

No content filter could lead to easy plagiarism

Share

Bank details