Google MusicLM turns language into music

Google introduces MusicLM, a generative text-to-music model. It can generate multi-minute tracks from text prompts.

While generative AI models for images have already reached the visual quality of human artists, models for audio and music still lag far behind. A "DALL-E for music" is difficult to realize. There are approaches like Meta's AudioGen, Riffusion or Google's AudioLM, but no convincing generative music model yet.

In addition to the complicated copyright situation for music, the temporal dimension is a major challenge: images are static, music changes. Depending on the culture, these changes follow certain rules - but can also be broken.

Google's MusicLM generates several minutes of music that sounds decent

AudioLM is a generative AI model for language, audio, and music. AudioLM uses techniques from large-scale language models: A BERT model specialized for audio (w2v-BERT) constructs semantic tokens from audio waveforms that can capture, for example, the phonetics of language or local melodies, harmonies, or rhythms. An encoder called SoundStream captures the finer details of audio waveforms in acoustic tokens and is responsible for high-quality audio synthesis.

Now Google is introducing MusicLM, a generative AI system that combines AudioLM with another model. This third component is called MuLan, and was trained by Google using pairs of 10-second audio snippets and matching text descriptions created by ten professional musicians. The MusicCaps training dataset of 5,500 music clips and text descriptions was published by Google.

After training, MusicLM predicts acoustic tokens, given both MuLan audio tokens and w2v-BERTs semantic tokens. These are then converted to audio by SoundStream. Using this method, Google can generate several minutes of music.

MusicLM can be controlled with melodies

The results range from a slow reggae song to an arcade game soundtrack, from relaxing jazz to Gregorian chants. MusicLM can be controlled with a short phrase or with detailed descriptions.

Prompt

The main soundtrack of an arcade game. It is fast-paced and upbeat, with a catchy electric guitar riff. The music is repetitive and easy to remember, but with unexpected sounds, like cymbal crashes or drum rolls.

Recommendation

AI research

How one simple metric could change computer vision forever

MusicLM Output

Prompt

We can hear a choir, singing a Gregorian chant, and a drum machine, creating a rhythmic beat. The slow, stately sounds of strings provide a calming backdrop for the fast, complex sounds of futuristic electronic music.

MusicLM Output

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

MusicLM can also process a combination of melody and lyrics, such as converting the melody of an acoustic guitar piece to synth.

Prompt (Fingerstyle Guitar Melody)

MusicLM Output (electronic synth lead)

MusicLM still has problems with vocals, negations in prompts, and temporal sequences. The team plans to address these issues in the future, and also plans to improve the quality of the generated audio.

More information and examples can be found on the MusicLM project page. According to the paper, there are currently no plans to release the model.

Google MusicLM turns language into music

Google's MusicLM generates several minutes of music that sounds decent

MusicLM can be controlled with melodies

How one simple metric could change computer vision forever

Demand for Glaze's AI art protection soars as Meta plans to train AI on user data

No, AI doesn’t mean human-made music is doomed. Here’s why

DALL-E 4 could be much better than DALL-E 3

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

Meta takes on OpenAI's GPT-4o with Llama 3 405B, its largest open-source LLM to date

AI models might need to scale down to scale up again

Google MusicLM turns language into music

Google's MusicLM generates several minutes of music that sounds decent

MusicLM can be controlled with melodies

Share

Bank details