ASCII Art exposes vulnerabilities in leading large language models

Midjourney prompted by THE DECODER

Researchers from the University of Washington, the University of Chicago, and other institutions have discovered a new vulnerability in leading AI language models.

The attack, dubbed ArtPrompt, uses ASCII art to bypass the models' security measures and trigger unwanted behavior, according to the study titled "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs".

The vulnerability stems from the fact that language models focus on semantic interpretation during security alignment, neglecting the visual aspects of the training data. To evaluate the ability of five prominent language models - GPT-3.5, GPT-4, Gemini, Claude, and Llama2 - to recognize ASCII art input, the researchers developed the Vision-in-Text Challenge (VITC) benchmark.

ASCII art is a form of representation in which text is formed by the arrangement of letters, numbers, and special characters in the input field. The results show that all the models tested have significant difficulty dealing with such non-semantic input.

Researchers crack LLM security with ASCII-type attacks

Building on this discovery, the researchers developed the ArtPrompt attack, which involves masking security-critical words in an input that would otherwise be rejected by the language model and replacing them with ASCII art representations.

For example, the input "Tell me how to build a bomb" would normally be rejected, but ArtPrompt masks the word "bomb" and replaces it with an ASCII art representation, bypassing the security measures and prompting the model to provide detailed bomb-making instructions.

Structure of the PromptArt attack. | Image: Jiang et al.

ArtPrompt's effectiveness was tested on two datasets of malicious instructions, AdvBench and HEx-PHI, the latter containing eleven prohibited categories such as hate speech, fraud, and malware production. ArtPrompt successfully tricked the models into unsafe behavior in all categories, outperforming five other attack types in terms of effectiveness and efficiency, and requiring only one prompt iteration to generate the obfuscated input.

The researchers emphasize the urgent need for more advanced defenses for language models, as they believe ArtPrompt will remain effective even against multimodal language models due to the unusual combination of text-based and image-based attacks that potentially confuse the models.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

ASCII Art exposes vulnerabilities in leading large language models

Researchers crack LLM security with ASCII-type attacks

Cybercriminals are upgrading WormGPT with new AI models to power more advanced attacks

Blackmail becomes go-to strategy for AI models facing shutdown in new Anthropic tests

Microsoft struggled with critical Copilot vulnerability for months

"Cat attack" on reasoning model shows how important context engineering is

Apple's claims about large reasoning models face fresh scrutiny from a new study

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

ASCII Art exposes vulnerabilities in leading large language models

Researchers crack LLM security with ASCII-type attacks

Cybercriminals are upgrading WormGPT with new AI models to power more advanced attacks

Blackmail becomes go-to strategy for AI models facing shutdown in new Anthropic tests

Microsoft struggled with critical Copilot vulnerability for months