Multimodal prompting is key for better generative AI

European AI company Aleph Alpha introduces an innovation for generative image models: multimodal prompts. Thanks to the new technique, AI models can be guided more precisely.

Generative AI models like OpenAIs DALL-E 2, Midjourney, or Stable Diffusion process text to generate original images. In contrast, the M-VADER diffusion model developed by Aleph Alpha together with TU Darmstadt can fuse multimodal inputs such as a photo or a sketch, and a textual description into a new image idea.

At the heart of the M-Vader architecture is the S-MAGMA multimodal decoder with 13 billion parameters. It combines the MAGMA image-language model with a Luminous 13B model fine-tuned for semantic search. Both pre-trained models originate from Aleph Alpha. The output of S-MAGMA guides the image generation process with a Stable Diffusion version fine-tuned for multimodal input.

Image: Aleph Alpha / TU Darmstadt - input images blurred due to copyright

According to the team, M-Vader can generate images based on multimodal context, create a new image from two images, or create variations of an image. The following graphic shows some examples of merging multiple images together with text instructions to create a new image.

Until now, AI imaging models mainly process text to generate original images. In contrast, the M-VADER diffusion model developed by Aleph Alpha together with TU Darmstadt can fuse multimodal inputs such as a photo, a sketch, and a textual description into a new image idea.

From the paper

Example of image variation with M-Vader. | Image: Aleph Alpha / TU Darmstadt

Multimodal prompts become part of Aleph Alpha's Luminous models

The researchers see their work on multimodal prompts as a contribution to the development of AI models that can better capture user intentions. There are more image examples in the paper.

With Luminous and Magma, Aleph Alpha already offers two AI base models in different scales for text and image captioning generation. According to Jonas Andrulis, CEO of Aleph Alpha, the multimodal image generation technology now unveiled is a world first. It is soon to become part of the Luminous offering.

"Our knowledge is not just text but multimodal and AI needs to be able to understand language and images together," Andrulis writes.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Multimodal prompting is key for better generative AI

Multimodal prompts become part of Aleph Alpha's Luminous models

Alibaba's new GPT-4o competitor Qwen VLo is no longer open source

Studio Ghibli founder Hayao Miyazaki's viral AI criticism lacks crucial context

Google adds native image generation to Gemini language models

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Multimodal prompting is key for better generative AI

Multimodal prompts become part of Aleph Alpha's Luminous models

Alibaba's new GPT-4o competitor Qwen VLo is no longer open source

Studio Ghibli founder Hayao Miyazaki's viral AI criticism lacks crucial context

Google adds native image generation to Gemini language models