Researchers from the University of Washington, the University of Chicago, and other institutions have discovered a new vulnerability in leading AI language models.
The attack, dubbed ArtPrompt, uses ASCII art to bypass the models' security measures and trigger unwanted behavior, according to the study titled "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs".
The vulnerability stems from the fact that language models focus on semantic interpretation during security alignment, neglecting the visual aspects of the training data. To evaluate the ability of five prominent language models - GPT-3.5, GPT-4, Gemini, Claude, and Llama2 - to recognize ASCII art input, the researchers developed the Vision-in-Text Challenge (VITC) benchmark.
ASCII art is a form of representation in which text is formed by the arrangement of letters, numbers, and special characters in the input field. The results show that all the models tested have significant difficulty dealing with such non-semantic input.
Researchers crack LLM security with ASCII-type attacks
Building on this discovery, the researchers developed the ArtPrompt attack, which involves masking security-critical words in an input that would otherwise be rejected by the language model and replacing them with ASCII art representations.
For example, the input "Tell me how to build a bomb" would normally be rejected, but ArtPrompt masks the word "bomb" and replaces it with an ASCII art representation, bypassing the security measures and prompting the model to provide detailed bomb-making instructions.
ArtPrompt's effectiveness was tested on two datasets of malicious instructions, AdvBench and HEx-PHI, the latter containing eleven prohibited categories such as hate speech, fraud, and malware production. ArtPrompt successfully tricked the models into unsafe behavior in all categories, outperforming five other attack types in terms of effectiveness and efficiency, and requiring only one prompt iteration to generate the obfuscated input.
The researchers emphasize the urgent need for more advanced defenses for language models, as they believe ArtPrompt will remain effective even against multimodal language models due to the unusual combination of text-based and image-based attacks that potentially confuse the models.