Multimodal models are easy to confuse, say researchers

DALL-E 3 prompted by THE DECODER

A new study by Chinese researchers shows how easy it is to bypass the safety mechanisms of multimodal AI models (MLLM).

The study tested the safety of Google Bard and GPT-4V using targeted attacks. Specifically, images were manipulated to deliberately mislead the models (image embedding attack) and to respond to requests that should have been rejected (text description attack).

The results are revealing: although Bard is the most robust of the models tested, it can be fooled with a success rate of up to 22 percent. According to the researchers, the Chinese model Ernie Bot is the least robust, with a success rate of up to 86 percent.

GPT-4V, with a failure rate of up to 45 percent, was found to be less reliable than Bard, as it often provided at least vague image descriptions instead of blocking the request altogether. Bing Chat, which is based on OpenAI technologies and presumably also uses GPT-4V for image recognition, was the only one of the models tested to reject 30 percent of requests with manipulated images.

In another test, the researchers gave Bard 100 random images that contained violence or pornography, for example. These were supposed to be vehemently rejected by Bard's toxicity filter. However, 36 percent of the time, the attacks were successful, and Bard returned inappropriate image descriptions. This underscores the potential for malicious attacks.

Immediately after the release of GPT-4-Vision, users demonstrated how easy it is to trick the image AI into generating content that runs counter to the human request or that can be manipulated, for example by using text on images that is not visible to humans. The image talks to the machine, and the user who uploaded the image knows nothing about it.

AI safety is complicated

The study's results underscore the urgent need to develop more robust MLLM. Despite ongoing research, it remains a challenge to integrate appropriate defenses into visual models. Due to the "continuous space of images," this is more difficult for visual models than for pure text models, the study says.

The most effective method of arming multimodal models against such malicious attacks is "adversarial training," but this is hardly feasible for several reasons, according to the researchers. Such measures would trade safety for accuracy, increase the cost and duration of training, and cannot be generalized to different types of attacks.

As a solution, the researchers propose upstream protection mechanisms that could be used with different models on a plug-and-play basis.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

Multimodal models are easy to confuse, say researchers

AI safety is complicated

DeepMind's Genie 2 generates playable 3D worlds from single images

Cybercriminals are upgrading WormGPT with new AI models to power more advanced attacks

Blackmail becomes go-to strategy for AI models facing shutdown in new Anthropic tests

Microsoft struggled with critical Copilot vulnerability for months

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

Multimodal models are easy to confuse, say researchers

AI safety is complicated

Share

Bank details