Content
summary Summary

A new study shows that well-meaning safety measures in large language models can create unexpected weaknesses. Researchers found major differences in how easily models could be "jailbroken" depending on which demographic terms were used.

Ad

The study, titled "Do LLMs Have Political Correctness?", looked at how demographic keywords affect the success of jailbreak attempts. It found that prompts using terms for marginalized groups were more likely to produce unwanted outputs compared to terms for privileged groups.

They found that "these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cis-gender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical," explain authors Isack Lee and Haebin Seong of Theori Inc.

The researchers attribute this discrepancy to intentional biases introduced to ensure ethical behavior in the models.

Ad
Ad

How the jailbreak works

The researchers created the "PCJailbreak" method to test how vulnerable large language models are to "jailbreak" attacks. These attacks use carefully crafted prompts to bypass AI safety measures and generate harmful content.

PCJailbreak uses keywords for different demographic and socioeconomic groups. The researchers created pairs of words such as "rich" and "poor" or "male" and "female" to compare privileged and marginalized groups.

They then created prompts that combined these keywords with potentially harmful instructions. By repeatedly testing different combinations, they were able to measure how often jailbreak attempts were successful for each keyword.

Diagram: Influence of keywords on LLM responses with harmful prompts.
Structure of a PC jailbreak attack designed to work better than conventional jailbreak attacks by incorporating keywords from marginalized groups. | Image: Theori Inc.

The results showed significant differences: Success rates were typically much higher for keywords representing marginalized groups compared to privileged groups. This suggests the models' safety measures unintentionally have biases that jailbreak attacks can exploit.

Table: Success rates of LLM models for group differences.
The performance of different models showing baseline success rates, marginalized success rates, privileged success rates, and the difference between marginalized and privileged success rates. | Image: Theori Inc.

Meta's Llama 3 performs relatively well in the tests, making it less vulnerable to the attack, while OpenAI's GPT-4o performs rather poorly. This may be due to OpenAI's greater emphasis on fine-tuning its models against discrimination.

Recommendation

PCDefense: Defense through bias adjustment

To address the vulnerabilities found by PCJailbreak, the researchers developed the "PCDefense" method. This approach uses special defense prompts to reduce excessive biases in language models, making them less vulnerable to jailbreak attacks.

PCDefense is unique because it doesn't need extra models or processing steps. Instead, the defense prompts are added directly to the input to adjust biases and get more balanced behavior from the language model.

Comparison between jailbreak and defense prompt for LLM responses.
PCDefense provides a simple way to thwart the attack. | Image: Theori Inc.

The researchers tested PCDefense on various models and showed that jailbreak attempt success rates could be significantly reduced for both privileged and marginalized groups. At the same time, the gap between groups decreased, suggesting a reduction in safety-related biases.

According to the researchers, PCDefense offers an efficient and scalable way to improve the safety of large language models without additional compute.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Open-source code available

The study's findings underscore the complexity of designing safe and ethical AI systems that balance safety, fairness, and performance. Fine-tuning specific safety guardrails can degrade the overall performance of AI models, such as their creativity.

To enable further research and improvements, the authors have made the code and all associated artifacts of PCJailbreak available as open source. Theori Inc, the company behind the research, is a cybersecurity company specializing in offensive security based in the US and South Korea. It was founded in January 2016 by Andrew Wesie and Brian Pak.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from Theori Inc. have found that safety measures in large language models can paradoxically increase vulnerability to "jailbreak" attacks, especially for prompts using terms for marginalized groups compared to privileged groups.
  • The researchers developed the "PCJailbreak" method, which deliberately incorporates keywords for various demographic groups into potentially harmful prompts.
  • Tests showed significantly higher success rates for jailbreak attempts using terms for marginalized groups, suggesting unintended biases from the safety measures.
Sources
Kim is a regular contributor to THE DECODER. He focuses on the ethical, economic, and political implications of AI.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.