Content
summary Summary

Google introduces MusicLM, a generative text-to-music model. It can generate multi-minute tracks from text prompts.

While generative AI models for images have already reached the visual quality of human artists, models for audio and music still lag far behind. A "DALL-E for music" is difficult to realize. There are approaches like Meta's AudioGen, Riffusion or Google's AudioLM, but no convincing generative music model yet.

In addition to the complicated copyright situation for music, the temporal dimension is a major challenge: images are static, music changes. Depending on the culture, these changes follow certain rules - but can also be broken.

Google's MusicLM generates several minutes of music that sounds decent

AudioLM is a generative AI model for language, audio, and music. AudioLM uses techniques from large-scale language models: A BERT model specialized for audio (w2v-BERT) constructs semantic tokens from audio waveforms that can capture, for example, the phonetics of language or local melodies, harmonies, or rhythms. An encoder called SoundStream captures the finer details of audio waveforms in acoustic tokens and is responsible for high-quality audio synthesis.

Ad
Ad

Now Google is introducing MusicLM, a generative AI system that combines AudioLM with another model. This third component is called MuLan, and was trained by Google using pairs of 10-second audio snippets and matching text descriptions created by ten professional musicians. The MusicCaps training dataset of 5,500 music clips and text descriptions was published by Google.

After training, MusicLM predicts acoustic tokens, given both MuLan audio tokens and w2v-BERTs semantic tokens. These are then converted to audio by SoundStream. Using this method, Google can generate several minutes of music.

MusicLM can be controlled with melodies

The results range from a slow reggae song to an arcade game soundtrack, from relaxing jazz to Gregorian chants. MusicLM can be controlled with a short phrase or with detailed descriptions.

Prompt

The main soundtrack of an arcade game. It is fast-paced and upbeat, with a catchy electric guitar riff. The music is repetitive and easy to remember, but with unexpected sounds, like cymbal crashes or drum rolls.

Recommendation

MusicLM Output

Prompt

We can hear a choir, singing a Gregorian chant, and a drum machine, creating a rhythmic beat. The slow, stately sounds of strings provide a calming backdrop for the fast, complex sounds of futuristic electronic music.

MusicLM Output

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

MusicLM can also process a combination of melody and lyrics, such as converting the melody of an acoustic guitar piece to synth.

Prompt (Fingerstyle Guitar Melody)

MusicLM Output (electronic synth lead)

MusicLM still has problems with vocals, negations in prompts, and temporal sequences. The team plans to address these issues in the future, and also plans to improve the quality of the generated audio.

More information and examples can be found on the MusicLM project page. According to the paper, there are currently no plans to release the model.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google is demonstrating MusicLM, a generative AI model for text-to-music.
  • MusicLM generates tracks up to five minutes long in a variety of styles based on text prompts.
  • Unlike previous generative music models, MusicLM's tracks are actually decent.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.