Image to text to music with CLIP interrogator and Mubert API

DALL-E 2 promopted by THE DECODER

2022 is the year of text-to-X systems. The company Mubert is now venturing into a generative AI system that creates music based on text input. It is still in its infancy.

Founded in 2017, U.S. startup Mubert specializes in generative AI for royalty-free music. Mubert's text-to-music app is a first attempt at generative AI that generates music from text input.

A demo version at Huggingface allows users to input the prompt, from which the system then pulls individual keywords and matches them to the internal tagging of recorded sound clips, assembling a piece up to 100 seconds long. Humans recorded the sounds, so it's not AI-generated music in the strict sense, but rather AI-generated pieces composed of human audio clips.

The input prompt and Mubert API tags are both encoded to latent space vectors of a transformer neural network. Then the closest tags vector is selected for each prompt and corresponding tags are sent to our API for music generation.

Mubert

The control via text prompts is therefore not as detailed as known from common image AIs. It seems to be rather an alternative interface to the generation interface that Mubert already offers on its website. The following video shows some demos.

From image to prompt to music

Mubert's AI sound service becomes multimedia when combined with images. Twitter user Sylvain Filoni has developed a HuggingFace application for this purpose: It generates a prompt that is extracted from an image via CLIP interrogator. This prompt, in turn, then becomes a short piece of music via the Mubert API. In a successful example, it sounds like this.

Music from the 1818 Wanderer above the Sea of Fog by Caspar David Friedrich ✨#mubert #ImageToMusic pic.twitter.com/IE2Xxk6wfI

- Sylvain Filoni (@fffiloni) October 28, 2022

Unfortunately, the generated sound does not always match the image. The following clip, which I created for the cover of this article, is more melancholic than cheerful and colorful.

The difficulty level is admittedly high because the robot is holding a tuba, so you probably expect to hear a tuba. However, CLIP Interrogator only identifies a "musical instrument". Nevertheless, words like "funk art", "bubbly" or "joyous trumpets" appear in the prompt, which could well have been translated into music. A second attempt with the same image also produces an entirely different result, which also doesn't match the subject, at least in my opinion. This is where the Mubert API reaches its limits.

Nevertheless, it is an interesting experiment and an indication of things to come. It was only at the beginning of October that Meta introduced "AudioGen", an AI system that can generate new audio signals from scratch to match a text input. The system is not yet designed for music, but that might only be a matter of time.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI in practice

Image to text to music with CLIP interrogator and Mubert API

From image to prompt to music

Ordinary chatbot answers could be an asset in court, judge suggests

Janus AI model fuses image understanding and generation in a single adaptable framework

Tiny open-source image model Meissonic offers impressive image quality for its size

Universal Music Group and Meta want to combat "unauthorized AI-generated content"

Apple's local AI agent framework paves the way for more useful Apple Intelligence

Apple AI researchers question OpenAI's claims about o1's reasoning capabilities

Tesla unveils Cybercab robot taxi, but robot Optimus is the bigger deal

Image to text to music with CLIP interrogator and Mubert API

From image to prompt to music

Share

Bank details