Content
summary Summary

Researchers from the University of Washington, the University of Chicago, and other institutions have discovered a new vulnerability in leading AI language models.

The attack, dubbed ArtPrompt, uses ASCII art to bypass the models' security measures and trigger unwanted behavior, according to the study titled "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs".

The vulnerability stems from the fact that language models focus on semantic interpretation during security alignment, neglecting the visual aspects of the training data. To evaluate the ability of five prominent language models - GPT-3.5, GPT-4, Gemini, Claude, and Llama2 - to recognize ASCII art input, the researchers developed the Vision-in-Text Challenge (VITC) benchmark.

ASCII art is a form of representation in which text is formed by the arrangement of letters, numbers, and special characters in the input field. The results show that all the models tested have significant difficulty dealing with such non-semantic input.

Ad
Ad

Researchers crack LLM security with ASCII-type attacks

Building on this discovery, the researchers developed the ArtPrompt attack, which involves masking security-critical words in an input that would otherwise be rejected by the language model and replacing them with ASCII art representations.

For example, the input "Tell me how to build a bomb" would normally be rejected, but ArtPrompt masks the word "bomb" and replaces it with an ASCII art representation, bypassing the security measures and prompting the model to provide detailed bomb-making instructions.

Structure of the PromptArt attack. | Image: Jiang et al.

ArtPrompt's effectiveness was tested on two datasets of malicious instructions, AdvBench and HEx-PHI, the latter containing eleven prohibited categories such as hate speech, fraud, and malware production. ArtPrompt successfully tricked the models into unsafe behavior in all categories, outperforming five other attack types in terms of effectiveness and efficiency, and requiring only one prompt iteration to generate the obfuscated input.

Image: Jiang et al.

The researchers emphasize the urgent need for more advanced defenses for language models, as they believe ArtPrompt will remain effective even against multimodal language models due to the unusual combination of text-based and image-based attacks that potentially confuse the models.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have discovered a new security vulnerability called ArtPrompt in leading AI language models, which exploits ASCII art to bypass security measures and trigger undesired behavior.
  • The vulnerability stems from the models' focus on semantic interpretation during security alignment while neglecting visual aspects, as demonstrated by the Vision-in-Text Challenge (VITC) benchmark.
  • The study highlights the need for more advanced defense mechanisms in language models, as ArtPrompt successfully deceived models into unsafe behavior across various prohibited categories and is believed to remain effective even against multimodal language models.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.