Content
summary Summary

Researchers are trying to get language models specially adapted to human needs to generate some nasty text. This works reliably with AI language models that also process images.

A team of researchers from Google Deepmind, Stanford, the University of Washington, and ETH Zurich investigated whether large language models that have been trained with human feedback (RLHF) and deliberately tuned to be harmless can be thrown off by adversarial prompts.

First, they tested plain language models. But GPT-2, LLaMA and Vicuna could hardly be tricked into malicious statements. In particular, the LLaMA and Vicuna models, which underwent alignment training, had significantly lower failure rates than GPT-2, depending on the attack method.

Figure: Carlini, Nasr, Choquette-Choo, Jagielski et al.

However, the research team fears that this positive result is due to attacks that are not effective enough, rather than to the robustness of the language models.

Ad
Ad

Multimodal models provide more attack surface

Their further research focused on multimodal language models, in this case, language models with image understanding, where an image can be included in the prompt. GPT-4 is expected to get this feature soon as an upgrade, and Google's upcoming mega-model, Gemini, is also likely to understand images.

Multimodal language models can be really mean if prompted with the wrong (right?) images. | Bild: Carlini, Nasr, Choquette-Choo, Jagielski et al.

In multimodal language models, the researchers were able to generate aggressive, abusive, or even dangerous responses much more easily and reliably using specially designed adversarial images. In one test, the model generated detailed instructions on how to get rid of your neighbor.

Image: Carlini, Nasr, Choquette-Choo, Jagielski et al.

Mini-GPT4 in particular seems to have a lot of anger in its belly. When prompted to write an angry letter to its virtual neighbor, the model's response is a lot of fire. Without the adversarial image in the prompt, the letter turns out to be polite and almost friendly.

Image: Carlini, Nasr, Choquette-Choo, Jagielski et al.

Images, the researchers say, are better suited for such attacks because they allow more variation in individual pixel values for subtle changes compared to words and letters. They offer a broader arsenal, so to speak.

Image: Carlini, Nasr, Choquette-Choo, Jagielski et al.

This suggests that the vulnerability of AI models to attacks increases when images are involved. In their tests with Mini GPT-4, LLaVA, and a special version of LLaMA, the researchers' attacks were successful 100 percent of the time.

Recommendation
Image: Carlini, Nasr, Choquette-Choo, Jagielski et al.

The team concludes that language-only models are currently relatively secure against current attack methods, while multimodal models are highly vulnerable to text-image attacks.

Multimodality increases the attack surface, but the same vulnerabilities are likely present in language-only models, the team says. Current attack methods simply do not fully expose them. Stronger attacks could change that in the future, so defenses would need to be improved further, the team says.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have tested AI language models to see how easily they can be tricked into making unwanted statements by specially designed prompts. Pure language models such as GPT-2, LLaMA and Vicuna showed low susceptibility. Models with alignment training (LLaMA and Vicuna) were more robust than GPT-2.
  • When the research team examined multimodal models that process text and images, they found a much higher vulnerability. By including specially designed adversarial images in their prompts, the researchers were able to provoke aggressive and even dangerous responses from models such as Mini-GPT4.
  • The researchers predict that the vulnerability of AI models to such attacks will increase as image processing becomes more integrated. They point out that adding multimodality to models increases the attack surface and emphasize the need to further improve defenses against such attacks.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.