summary Summary

Systems like ChatGPT or Midjourney are mostly text or image experts. What happens when you combine these skills? Researchers at Microsoft are testing this with Kosmos-1, a model that combines image and text.

Multimodal AI models could develop a better understanding of the world by learning from different data sources, some researchers believe. They can combine knowledge from different modalities to solve tasks such as describing images in natural language.

"Being a basic part of intelligence, multimodal perception is a necessity to achieve artificial general intelligence, in terms of knowledge acquisition and grounding to the real world," writes the Microsoft research team. It presents Kosmos-1, a multimodal large language model (MLLM).

In addition to language and multimodal perception, a possible artificial general intelligence (AGI) would also need the ability to model the world and act, they say.


Kosmos-1 knows about language and images

Microsoft trained Kosmos-1 with partially related image and language data, such as word-image pairs. In addition, the team used large amounts of Internet text, as is common with large language models.

As a result, the model can understand images and text, including describing images in natural language, recognizing text on images, writing captions for images, and answering questions about images. Kosmos-1 can perform these tasks on direct request or, similar to ChatGPT, in a dialog situation.

Kosmos-1 can answer questions about images, either about the content on the image or by processing information in words and numbers on the image. | Image: Microsoft

Because it has the same text capabilities as large language models, it can also use methods such as chain-of-thought prompting to produce better results.

Lässt man die KI Schritt für Schritt zu einer Lösung kommen, ist diese mit höherer Wahrscheinlichkeit korrekt.
If you let the AI arrive at a solution step by step, it is more likely to be correct. | Image: Microsoft

On a visual IQ test, KOSMOS-1 performed about five to nine percent better than chance, which the research team believes shows that KOSMOS-1 can perceive abstract conceptual patterns in a nonverbal context by combining nonverbal reasoning with the perception of linguistic patterns. However, there is still a large performance gap with the average adult level.

The Raven IQ test still overwhelms KOSMOS-1, but the model is better than chance. | Image: Microsoft

The ability of multimodal models to represent implicit connections between different concepts has already been demonstrated in the OpenAI study of CLIP neurons.


Multimodal AI as a possible next step in AI development

Microsoft's approach is not new; the German company Aleph Alpha presented MAGMA, a combined image language model, and M-Vader, a method for multimodal prompting. Google presented the "Future of Google Search" in the spring of 2021 with MUM, which enables multimodal queries and provides more contextual knowledge.

Deepmind's Flamingo, which also combines language and image processing, goes in a similar direction. The Microsoft research team also used Flamingo to benchmark Kosmos-1's performance in tests such as image captioning and answering questions about image content. The Microsoft model performed as well as, and in some cases slightly better than, Kosmos-1.

The researchers also trained a language model (LLM) on the same text data as Kosmos-1 and had the two models compete in language-only tasks.

Here the two models tied in most cases, with Kosmos-1 performing significantly better on visual reasoning tasks that require an understanding of the properties of everyday real-world objects such as color, size, and shape, the team says.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The reason for KOSMOS-1’s superior performance is that it has modality transferability, which enables the model to transfer visual knowledge to language tasks. On the contrary, LLM has to rely on textual knowledge and clues to answer visual commonsense questions, which limits its ability to reason about object properties.

From the paper

Multimodal large language models combine the best of both worlds, the researchers write: In-context learning and the ability to follow natural language instructions from large language models, and "perception is aligned with language models by training on multimodal corpora."

The results of Kosmos-1 are "promising" across a wide range of language and multimodal tasks, they say. Multimodal models would offer new capabilities and opportunities compared to large language models.

Kosmos-1 has 1.6 billion parameters, which is tiny compared to today's large language models. The team would like to scale up Kosmos-1 to include more modalities, such as speech, in model training. A larger model with more modalities could then overcome many of the current limitations, the researchers write.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Kosmos-1 from Microsoft has been trained on text, images, and text-image pairs.
  • It understands images and text and can answer questions about an image, for example.
  • Researchers see these so-called multimodal models as a step toward AGI. They want to scale the model and train it on other types of data, such as speech.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.