summary Summary

DetGPT gives a preview of the AI applications that will be possible with multimodal models in the future - and not just the good ones.


At the GPT-4 launch, OpenAI demonstrated some multimodal capabilities, including converting a photographed and scribbled web design into code or the ability to answer questions about images, which is already available through the "Be My Eyes" application. Although these capabilities are not yet widely available, open-source models such as miniGPT-4 have provided a first glimpse of them.

Researchers at the Hong Kong University of Science and Technology and the University of Hong Kong now demonstrate a more powerful miniGPT-4 alternative in DetGPT. The multimodal AI model combines the BLIP-2 visual encoder with a 13-billion-parameter Robin or Vicuna family language model and a pre-trained detector (grounding-DINO).

DetGPT combines language model capabilities with image recognition

DetGPT focuses on implementing a "take a photo and ask" capability that allows users to request specific information or actions related to objects in an image. The model goes beyond existing alternatives such as miniGPT-4 by enabling accurate localization of objects and understanding of complex human commands through a large language model.


In one example, the team shows how DetGPT, in response to the statement "I want a cold drink" in a photo of a kitchen where no drink can be found, identifies the refrigerator as the best option. In response to "I want to wake up in the morning," the system marks the alarm clock on a cluttered table, and in response to "Which fruits help with high blood pressure?" the system marks fruits on a market stall that might help with high blood pressure.

For training, the team created a dataset of 30,000 examples from 5,000 images in the COCO dataset and ChatGPT-generated examples of image instructions.

DetGPT for surveillance?

Models such as DetGPT, miniGPT-4, a multimodal GPT-4, or robotics-specific models such as PaLM-E show the potential of additional modalities in large AI models. How far their use will extend is still unclear, but like language models, they are likely to change many jobs - and possibly simplify government surveillance.

A Reddit post in which one of the researchers involved presents the work includes the following passage:

Moreover, the examples of object categories provided by humans are not always comprehensive. For instance, if monitoring is required to detect behaviors that violate public order relative to public places, humans may only be able to provide a few simple scenarios, such as holding a knife or smoking. However, if the question "detect behaviors that violate public order" is directly posed to the detection model, the model can think and reason based on its own knowledge, thus capturing more unacceptable behaviors and generalizing to more relevant categories that need to be detected.

This would first require a language model that is fine-tuned to the particular society and its norms. The result would then be an AI system that could recognize undesirable behavior much more dynamically, no longer recognizing its occurrence according to a statically labeled pattern, but also inferring it from the occurrence of accompanying factors.


The code and dataset are available on GitHub, as is a demo. The model is not yet available.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Combining computer vision and speech processing, DetGPT can answer questions about images and point directly to objects.
  • For example, DetGPT can recognize fruits that can lower blood pressure or give gamers tips in Zelda.
  • But such multimodal models could also be used in government surveillance, as one example suggests.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.