Multimodal prompting is key for better generative AI

European AI company Aleph Alpha introduces an innovation for generative image models: multimodal prompts. Thanks to the new technique, AI models can be guided more precisely.

Generative AI models like OpenAIs DALL-E 2, Midjourney, or Stable Diffusion process text to generate original images. In contrast, the M-VADER diffusion model developed by Aleph Alpha together with TU Darmstadt can fuse multimodal inputs such as a photo or a sketch, and a textual description into a new image idea.

At the heart of the M-Vader architecture is the S-MAGMA multimodal decoder with 13 billion parameters. It combines the MAGMA image-language model with a Luminous 13B model fine-tuned for semantic search. Both pre-trained models originate from Aleph Alpha. The output of S-MAGMA guides the image generation process with a Stable Diffusion version fine-tuned for multimodal input.

Image: Aleph Alpha / TU Darmstadt - input images blurred due to copyright

According to the team, M-Vader can generate images based on multimodal context, create a new image from two images, or create variations of an image. The following graphic shows some examples of merging multiple images together with text instructions to create a new image.

Until now, AI imaging models mainly process text to generate original images. In contrast, the M-VADER diffusion model developed by Aleph Alpha together with TU Darmstadt can fuse multimodal inputs such as a photo, a sketch, and a textual description into a new image idea.

From the paper

Example of image variation with M-Vader. | Image: Aleph Alpha / TU Darmstadt

Multimodal prompts become part of Aleph Alpha's Luminous models

The researchers see their work on multimodal prompts as a contribution to the development of AI models that can better capture user intentions. There are more image examples in the paper.

With Luminous and Magma, Aleph Alpha already offers two AI base models in different scales for text and image captioning generation. According to Jonas Andrulis, CEO of Aleph Alpha, the multimodal image generation technology now unveiled is a world first. It is soon to become part of the Luminous offering.

"Our knowledge is not just text but multimodal and AI needs to be able to understand language and images together," Andrulis writes.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Multimodal prompting is key for better generative AI

Multimodal prompts become part of Aleph Alpha's Luminous models

Sam Altman predicts AI will create flawless poetry, but no one will care

A children's book illustrates the pros and cons of the commercial use of AI image generators

Google airs AI-generated ad without AI label, citing viewer apathy

German court deepens the split on AI and copyright with its latest ruling

OpenAI and Microsoft call AGI pointless, then make it the linchpin of billion-dollar deals

Google leans on token metrics, not revenue, adding to bubble talk about AI growth

Multimodal prompting is key for better generative AI

Multimodal prompts become part of Aleph Alpha's Luminous models

Sam Altman predicts AI will create flawless poetry, but no one will care

A children's book illustrates the pros and cons of the commercial use of AI image generators

Google airs AI-generated ad without AI label, citing viewer apathy