Multimodal AI: Between dietary advice and surveillance dystopia

DetGPT gives a preview of the AI applications that will be possible with multimodal models in the future - and not just the good ones.

At the GPT-4 launch, OpenAI demonstrated some multimodal capabilities, including converting a photographed and scribbled web design into code or the ability to answer questions about images, which is already available through the "Be My Eyes" application. Although these capabilities are not yet widely available, open-source models such as miniGPT-4 have provided a first glimpse of them.

Researchers at the Hong Kong University of Science and Technology and the University of Hong Kong now demonstrate a more powerful miniGPT-4 alternative in DetGPT. The multimodal AI model combines the BLIP-2 visual encoder with a 13-billion-parameter Robin or Vicuna family language model and a pre-trained detector (grounding-DINO).

DetGPT combines language model capabilities with image recognition

DetGPT focuses on implementing a "take a photo and ask" capability that allows users to request specific information or actions related to objects in an image. The model goes beyond existing alternatives such as miniGPT-4 by enabling accurate localization of objects and understanding of complex human commands through a large language model.

In one example, the team shows how DetGPT, in response to the statement "I want a cold drink" in a photo of a kitchen where no drink can be found, identifies the refrigerator as the best option. In response to "I want to wake up in the morning," the system marks the alarm clock on a cluttered table, and in response to "Which fruits help with high blood pressure?" the system marks fruits on a market stall that might help with high blood pressure.

For training, the team created a dataset of 30,000 examples from 5,000 images in the COCO dataset and ChatGPT-generated examples of image instructions.

DetGPT for surveillance?

Models such as DetGPT, miniGPT-4, a multimodal GPT-4, or robotics-specific models such as PaLM-E show the potential of additional modalities in large AI models. How far their use will extend is still unclear, but like language models, they are likely to change many jobs - and possibly simplify government surveillance.

A Reddit post in which one of the researchers involved presents the work includes the following passage:

Moreover, the examples of object categories provided by humans are not always comprehensive. For instance, if monitoring is required to detect behaviors that violate public order relative to public places, humans may only be able to provide a few simple scenarios, such as holding a knife or smoking. However, if the question "detect behaviors that violate public order" is directly posed to the detection model, the model can think and reason based on its own knowledge, thus capturing more unacceptable behaviors and generalizing to more relevant categories that need to be detected.

This would first require a language model that is fine-tuned to the particular society and its norms. The result would then be an AI system that could recognize undesirable behavior much more dynamically, no longer recognizing its occurrence according to a statically labeled pattern, but also inferring it from the occurrence of accompanying factors.

Recommendation

AI research

DOOM on the toaster was fun, on AI it's groundbreaking

The code and dataset are available on GitHub, as is a demo. The model is not yet available.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Multimodal AI: Between dietary advice and surveillance dystopia

DetGPT combines language model capabilities with image recognition

DetGPT for surveillance?

DOOM on the toaster was fun, on AI it's groundbreaking

SciArena lets scientists compare LLMs on real research questions

Microsoft’s MAI-DxO boosts AI diagnostic accuracy and cuts costs by nearly 70 percent

Researchers say they may have found a ladder to climb the "data wall"

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Multimodal AI: Between dietary advice and surveillance dystopia

DetGPT combines language model capabilities with image recognition

DetGPT for surveillance?

Share

Bank details