Content
summary Summary

Meta's MusicGen can generate short new pieces of music based on text prompts, which can optionally be aligned to an existing melody.

Like most language models today, MusicGen is based on a Transformer model. Just as a language model predicts the next characters in a sentence, MusicGen predicts the next section in a piece of music.

The researchers decompose the audio data into smaller components using Meta's EnCodec audio tokenizer. As a single-stage model that processes tokens in parallel, MusicGen is fast and efficient.

The team used 20,000 hours of licensed music for training. In particular, they relied on an internal dataset of 10,000 high-quality music tracks, as well as music data from Shutterstock and Pond5.

Ad
Ad

MusicGen can handle both text and music prompts

In addition to the efficiency of the architecture and the speed of generation, MusicGen is unique in its ability to handle both text and music prompts. The text sets the basic style, which then matches the melody in the audio file.

For example, if you combine the text prompt "a light and cheerful EDM track with syncopated drums, airy pads and strong emotions, tempo: 130 BPM" with the melody of Bach's world-famous "Toccata and Fugue in D Minor (BWV 565)", the following piece of music can be generated.

Video: Meta

You can't precisely control the orientation to the melody, e.g., to hear a melody in different styles. It only serves as a rough guideline for the generation and is not exactly reflected in the output.

MusicGen just ahead of Google's MusicLM

The authors of the study ran tests on three versions of their model at different sizes: 300 million (300M), 1.5 billion (1.5B), and 3.3 billion (3.3B) parameters. They found that the larger models produced higher quality audio, but the 1.5 billion parameter model was rated best by humans. The 3.3 billion parameter model, on the other hand, is better at accurately matching text input and audio output.

Recommendation

Compared to other music models such as Riffusion, Mousai, MusicLM, and Noise2Music, MusicGen performs better on both objective and subjective metrics that test how well the music matches the lyrics and how plausible the composition is. Overall, the models are just above the level of Google's MusicLM.

Meta has released the code and models as open source on Github, and commercial use is permitted. A demo is available on Huggingface.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Objektive Metriken:Fréchet Audio Distance (FAD): Geringere Werte zeigen an, dass die generierten Audiodaten plausibler sind. Kullback-Leibler-Divergenz (KL): Eine niedrigere Punktzahl deutet darauf hin, dass die generierte Musik ähnliche Konzepte wie die Referenzmusik aufweist. CLAP-Bewertung: Diese Punktzahl quantifiziert die Audio-Text-Ausrichtung. Subjektive Metriken: Gesamtqualität (OVL): Menschliche Bewerter bewerteten die Wahrnehmungsqualität der Hörproben auf einer Skala von 1 bis 100. Relevanz zur Texteingabe (REL): Menschliche Bewerter bewerteten die Übereinstimmung zwischen Audio und Text auf einer Skala von 1 bis 100.
Objective metrics: Fréchet Audio Distance (FAD): lower values indicate that the generated audio is more plausible. Kullback-Leibler Divergence (KL): a lower value indicates that the generated music has similar concepts to the reference music. CLAP score: This score quantifies the audio-text alignment. Subjective metrics: Overall Quality (OVL): Human raters rated the perceptual quality of the audio samples on a scale of 1 to 100. Relevance to Text Input (REL): Human raters rated the match between audio and text on a scale of 1 to 100. Image: Meta
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Metas MusicGen is an AI model that generates new pieces of music from text input and can optionally be based on existing melodies.
  • The Transformer-based architecture enables efficient processing of audio and text data. Tests show that MusicGen's performance is comparable to Google's MusicLM.
  • Meta is releasing the model and code as open source for research and commercial use. A demo is available on Huggingface.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.