Where is the "DALL-E for music"?

Nov 22, 2022

Midjourney prompted by THE DECODER

Enter a line of text and hear a piece of music after a few seconds? There are still some hurdles to clear before that happens, says one analyst.

First, it was AI-generated text, then images, which have become more sophisticated recently. HD Video and 3D AI generators are also in the works.

That rightly raises the question: Where is a similar service to GPT-3, Midjourney, or DALL-E for the music industry? Cherie Hu of Water and Music, a research and intelligence network for the new music industry, made some arguments in a Twitter thread as to why such a service is a long time coming.

Too little training data, too many copyrights

The first point she raises is the lack of training data. While each of the available text-to-image models has been trained with dozens of terabytes of data, there is not nearly as much public training data for music. To get to that point, Hu says, you'd have to train a model with all published music and also access the private drafts of DAWs like GarageBand, Ableton Live, or Logic.

As with image generators, copyright considerations also play a major role: It is true that millions of music tracks can be pirated from music streaming services and then used for training. But that would immediately bring the major labels and their lawyers onto the scene.

"Lawyers in the music industry have more power than in any other creative industry," Hu says. Some Artists and coders are already fighting generative AI that could infringe on copyrights.

Lack of expertise outside academic research

While breakthroughs are being made from the open-source community in image and text AIs, the music industry is still dominated by academia. "There's less data, so the work is just harder and slower. And the Nexus of people who know machine learning, music production, signal processing, etc., is tiny."

According to Wu, this also has to do with the fact that music is more difficult to sift through and, above all, to evaluate than visual art. "It literally takes time to listen to and evaluate a one-minute song. In that same time, you can scan hundreds of images."

Hu summarizes that the best AI models for music currently …

require more specialist technical knowledge to run,
take longer to run,
are more expensive to run,
have only OK output,
and are harder to rally public excitement around.

When does generative AI for music have its Midjourney moment?

However, Hu draws a conclusion that shouldn't make the music industry breathe a sigh of relief: "This is all going to change very soon, given how quickly the creative AI landscape is evolving."

Early examples include startups like Mubert, which recently unveiled a text-to-music model, and Sony's AI division, which is researching neural synthesizers.

The HarmonAI open-source project is also worth mentioning. It describes itself as a community-oriented organization that provides open-source tools for generative audio to make and promote music production more accessible to all.

Its current work, "Dance Diffusion," a generative audio model, is already available for testing through the Dance Diffusion Colab. Harmonai is supported by London-based startup Stability AI, which also enabled the open-source Stable Diffusion model.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder