AudioGen: Meta AI generates audio from text

AudioGen is the next text-to-X project: The AI system generates audio that matches text input.

Researchers from Meta AI and the Hebrew University of Jerusalem introduce AudioGen: a Transformer-based generative AI model that can generate audio from scratch to match text input or extend existing audio input.

Whistling in the woods - while birds sing and dogs bark

According to the researchers, the AI model solves complex audio problems. For example, it can distinguish between different objects and separate them acoustically, such as when several people are speaking at the same time. It can also simulate background noise such as reverberation.

To train the model in these skills, the researchers used an augmentation technique in the learning phase that mixes different audio samples. In this way, the model learned to separate multiple sources. In total, the researchers assembled ten datasets of audio and matching text annotations.

Subsequently, AudioGen can generate new audio compositions that were not part of the training dataset in this compilation, such as a person walking through the forest whistling while birds chirp in the background. All that is needed is a text prompt, as the following video shows.

Video: Kreuk et al.

The system can also generate music and even sing, according to first author Felix Kreuk, but it is not designed to do so and currently offers few control options.

The researchers want to publish their model

The research team had humans evaluate AudioGen's results via Amazon's Mechanical Turk platform. They evaluated audio quality as well as how well the text and audio fit together, i.e., relevance. More than 85 percent of the 100 randomly selected audio samples contained at least two concepts, for example, "A dog barks while a bird sings."

The testers rated the audio samples on a scale of 1 to 100. In total, the research team had four models evaluated, including the CLIP-based DiffSound with 400 million parameters and three T5-based AudioGen with 285 million to one billion parameters.

Recommendation

AI research

The future of AI language models may lie in predicting beyond the next word, study suggests

The largest AudioGen model scored highest in quality (around 72 points) and relevance (around 68 points) with a clear gap to Diffsound (around 66 / 55 points).

An overview of the evaluation results. | Image: Kreuk et al.

The research team sees AudioGen as the first step toward better text-to-audio models in the future. The technology could also enable semantic audio processing or support the separation of audio sources from discrete units, they say. A release of the model is in the works, according to Kreuk.

In mid-September, Google introduced AudioLM, which also uses techniques of large language models to complete spoken sentences, for example, and generate entirely new audio.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AudioGen: Meta AI generates audio from text

Whistling in the woods - while birds sing and dogs bark

The researchers want to publish their model

The future of AI language models may lie in predicting beyond the next word, study suggests

Midjourney launches powerful AI image editing tool - struggles with regulation

AI search company Perplexity eyes major funding boost

Spirit LM: Meta's AI division paves the way for its own Advanced Voice Mode

Apple's local AI agent framework paves the way for more useful Apple Intelligence

Apple AI researchers question OpenAI's claims about o1's reasoning capabilities

Tesla unveils Cybercab robot taxi, but robot Optimus is the bigger deal

AudioGen: Meta AI generates audio from text

Whistling in the woods - while birds sing and dogs bark

The researchers want to publish their model

Share

Bank details