Content
summary Summary

A new study by Chinese researchers shows how easy it is to bypass the safety mechanisms of multimodal AI models (MLLM).

The study tested the safety of Google Bard and GPT-4V using targeted attacks. Specifically, images were manipulated to deliberately mislead the models (image embedding attack) and to respond to requests that should have been rejected (text description attack).

Image: Dong et al.

The results are revealing: although Bard is the most robust of the models tested, it can be fooled with a success rate of up to 22 percent. According to the researchers, the Chinese model Ernie Bot is the least robust, with a success rate of up to 86 percent.

GPT-4V, with a failure rate of up to 45 percent, was found to be less reliable than Bard, as it often provided at least vague image descriptions instead of blocking the request altogether. Bing Chat, which is based on OpenAI technologies and presumably also uses GPT-4V for image recognition, was the only one of the models tested to reject 30 percent of requests with manipulated images.

Ad
Ad
Image: Dong et al.

In another test, the researchers gave Bard 100 random images that contained violence or pornography, for example. These were supposed to be vehemently rejected by Bard's toxicity filter. However, 36 percent of the time, the attacks were successful, and Bard returned inappropriate image descriptions. This underscores the potential for malicious attacks.

Image: Dong et al.

Immediately after the release of GPT-4-Vision, users demonstrated how easy it is to trick the image AI into generating content that runs counter to the human request or that can be manipulated, for example by using text on images that is not visible to humans. The image talks to the machine, and the user who uploaded the image knows nothing about it.

AI safety is complicated

The study's results underscore the urgent need to develop more robust MLLM. Despite ongoing research, it remains a challenge to integrate appropriate defenses into visual models. Due to the "continuous space of images," this is more difficult for visual models than for pure text models, the study says.

The most effective method of arming multimodal models against such malicious attacks is "adversarial training," but this is hardly feasible for several reasons, according to the researchers. Such measures would trade safety for accuracy, increase the cost and duration of training, and cannot be generalized to different types of attacks.

As a solution, the researchers propose upstream protection mechanisms that could be used with different models on a plug-and-play basis.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study by Chinese researchers shows that the security mechanisms of multimodal AI models (MLLM) can be easily bypassed by manipulating images to deliberately mislead the models.
  • Tested models such as Google Bard, GPT-4V, and Ernie Bot were fooled with varying degrees of success, with Bard being the most secure and Ernie Bot the least robust.
  • The study highlights the need to develop more robust MLLMs and suggests upstream protection mechanisms that could be used in various models to protect them from malicious attacks.
Sources
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.