Visual ChatGPT: Chatbot can now process images

The future of AI models is multimodal, there's no doubt about that. However, this does not necessarily require the training of large new models. Instead, existing solutions can be linked together.

Microsoft has added an important feature to the OpenAI chatbot ChatGPT, which was released in November 2022: image processing. Until now, the language model could only handle text, but Visual ChatGPT can send and receive images as well as text.

According to the researchers, a multimodal conversation model could be trained for this purpose, but this would require a large amount of data and computing resources. In addition, this approach is not very flexible, and the model cannot be extended to other modalities, such as audio or video, without new training.

Linking ChatGPT to 22 image models

Instead of training a new model, the researchers linked ChatGPT to 22 different Visual Foundation Models (VFM), including Stable Diffusion. These models perform different tasks, such as answering questions about images, generating and processing images, or extracting information such as depth data.

The team bridges the gap between ChatGPT and VFM with a Prompt Manager that performs the following tasks:

Explicitly tell ChatGPT the capabilities of each VFM and specifies the input-output formats.
Converts different visual information such as PNGs or images with depth information into a language format that ChatGPT can understand.
Handles the histories, priorities, and conflicts of the different VFMs.

Visual ChatGPT can generate images, name them correctly, save them, and keep them ready for further input or process images from users as input.

If the conversation model is not clear which VFM is best suited to solve the task, Visual ChatGPT will ask. It can also connect multiple VFMs in this way.

Although the examples shown by Microsoft with Visual ChatGPT are promising, there are still some limitations. Visual ChatGPT is, of course, completely dependent on ChatGPT and the linked image models.

The maximum number of tokens that ChatGPT can process is also a limiting factor. In addition, a significant amount of prompt engineering is required to convert VFMs to language.

Recommendation

AI research

Apple's local AI agent framework paves the way for more useful Apple Intelligence

Previous developments laid important foundations

Microsoft is integrating into Visual ChatGPT some existing methods for more control over image models with additional models or prompt engineering. There have been several advances in this area in recent months, such as InstructPix2Pix, ControlNet, or GLIGEN.

The researchers have published their source code on GitHub. A demo is also available on Hugging Face but requires a separate API key from OpenAI.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Visual ChatGPT: Chatbot can now process images

Linking ChatGPT to 22 image models

Apple's local AI agent framework paves the way for more useful Apple Intelligence

Previous developments laid important foundations

SoftBank plans billion-dollar investment in OpenAI, Microsoft hosts Deepseek

Google urges FTC to overturn Microsoft's exclusive OpenAI cloud deal

North Dakota leads race for hundreds of billions in AI computing projects

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Visual ChatGPT: Chatbot can now process images

Linking ChatGPT to 22 image models

Previous developments laid important foundations

Share

Bank details