Google introduces MusicLM, a generative text-to-music model. It can generate multi-minute tracks from text prompts.
While generative AI models for images have already reached the visual quality of human artists, models for audio and music still lag far behind. A "DALL-E for music" is difficult to realize. There are approaches like Meta's AudioGen, Riffusion or Google's AudioLM, but no convincing generative music model yet.
In addition to the complicated copyright situation for music, the temporal dimension is a major challenge: images are static, music changes. Depending on the culture, these changes follow certain rules - but can also be broken.
Google's MusicLM generates several minutes of music that sounds decent
AudioLM is a generative AI model for language, audio, and music. AudioLM uses techniques from large-scale language models: A BERT model specialized for audio (w2v-BERT) constructs semantic tokens from audio waveforms that can capture, for example, the phonetics of language or local melodies, harmonies, or rhythms. An encoder called SoundStream captures the finer details of audio waveforms in acoustic tokens and is responsible for high-quality audio synthesis.
Now Google is introducing MusicLM, a generative AI system that combines AudioLM with another model. This third component is called MuLan, and was trained by Google using pairs of 10-second audio snippets and matching text descriptions created by ten professional musicians. The MusicCaps training dataset of 5,500 music clips and text descriptions was published by Google.
After training, MusicLM predicts acoustic tokens, given both MuLan audio tokens and w2v-BERTs semantic tokens. These are then converted to audio by SoundStream. Using this method, Google can generate several minutes of music.
MusicLM can be controlled with melodies
The results range from a slow reggae song to an arcade game soundtrack, from relaxing jazz to Gregorian chants. MusicLM can be controlled with a short phrase or with detailed descriptions.
Prompt
The main soundtrack of an arcade game. It is fast-paced and upbeat, with a catchy electric guitar riff. The music is repetitive and easy to remember, but with unexpected sounds, like cymbal crashes or drum rolls.
MusicLM Output
Prompt
We can hear a choir, singing a Gregorian chant, and a drum machine, creating a rhythmic beat. The slow, stately sounds of strings provide a calming backdrop for the fast, complex sounds of futuristic electronic music.
MusicLM Output
MusicLM can also process a combination of melody and lyrics, such as converting the melody of an acoustic guitar piece to synth.
Prompt (Fingerstyle Guitar Melody)
MusicLM Output (electronic synth lead)
MusicLM still has problems with vocals, negations in prompts, and temporal sequences. The team plans to address these issues in the future, and also plans to improve the quality of the generated audio.
More information and examples can be found on the MusicLM project page. According to the paper, there are currently no plans to release the model.