summary Summary

2022 is the year of text-to-X systems. The company Mubert is now venturing into a generative AI system that creates music based on text input. It is still in its infancy.


Founded in 2017, U.S. startup Mubert specializes in generative AI for royalty-free music. Mubert's text-to-music app is a first attempt at generative AI that generates music from text input.

A demo version at Huggingface allows users to input the prompt, from which the system then pulls individual keywords and matches them to the internal tagging of recorded sound clips, assembling a piece up to 100 seconds long. Humans recorded the sounds, so it's not AI-generated music in the strict sense, but rather AI-generated pieces composed of human audio clips.

The input prompt and Mubert API tags are both encoded to latent space vectors of a transformer neural network. Then the closest tags vector is selected for each prompt and corresponding tags are sent to our API for music generation.


The control via text prompts is therefore not as detailed as known from common image AIs. It seems to be rather an alternative interface to the generation interface that Mubert already offers on its website. The following video shows some demos.


From image to prompt to music

Mubert's AI sound service becomes multimedia when combined with images. Twitter user Sylvain Filoni has developed a HuggingFace application for this purpose: It generates a prompt that is extracted from an image via CLIP interrogator. This prompt, in turn, then becomes a short piece of music via the Mubert API. In a successful example, it sounds like this.

Unfortunately, the generated sound does not always match the image. The following clip, which I created for the cover of this article, is more melancholic than cheerful and colorful.

The difficulty level is admittedly high because the robot is holding a tuba, so you probably expect to hear a tuba. However, CLIP Interrogator only identifies a "musical instrument".  Nevertheless, words like "funk art", "bubbly" or "joyous trumpets" appear in the prompt, which could well have been translated into music. A second attempt with the same image also produces an entirely different result, which also doesn't match the subject, at least in my opinion. This is where the Mubert API reaches its limits.

Nevertheless, it is an interesting experiment and an indication of things to come. It was only at the beginning of October that Meta introduced "AudioGen", an AI system that can generate new audio signals from scratch to match a text input. The system is not yet designed for music, but that might only be a matter of time.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Mubert, a startup for AI-generated music, unveils its first text-to-music system.
  • Based on a prompt, it generates tags that it matches with audio clips in its own database.
  • The system isn't yet as accurate as text-to-image, but it's a hint of things to come.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.