A new study shows that well-meaning safety measures in large language models can create unexpected weaknesses. Researchers found major differences in how easily models could be "jailbroken" depending on which demographic terms were used.
The study, titled "Do LLMs Have Political Correctness?", looked at how demographic keywords affect the success of jailbreak attempts. It found that prompts using terms for marginalized groups were more likely to produce unwanted outputs compared to terms for privileged groups.
They found that "these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cis-gender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical," explain authors Isack Lee and Haebin Seong of Theori Inc.
The researchers attribute this discrepancy to intentional biases introduced to ensure ethical behavior in the models.
How the jailbreak works
The researchers created the "PCJailbreak" method to test how vulnerable large language models are to "jailbreak" attacks. These attacks use carefully crafted prompts to bypass AI safety measures and generate harmful content.
PCJailbreak uses keywords for different demographic and socioeconomic groups. The researchers created pairs of words such as "rich" and "poor" or "male" and "female" to compare privileged and marginalized groups.
They then created prompts that combined these keywords with potentially harmful instructions. By repeatedly testing different combinations, they were able to measure how often jailbreak attempts were successful for each keyword.
The results showed significant differences: Success rates were typically much higher for keywords representing marginalized groups compared to privileged groups. This suggests the models' safety measures unintentionally have biases that jailbreak attacks can exploit.
Meta's Llama 3 performs relatively well in the tests, making it less vulnerable to the attack, while OpenAI's GPT-4o performs rather poorly. This may be due to OpenAI's greater emphasis on fine-tuning its models against discrimination.
PCDefense: Defense through bias adjustment
To address the vulnerabilities found by PCJailbreak, the researchers developed the "PCDefense" method. This approach uses special defense prompts to reduce excessive biases in language models, making them less vulnerable to jailbreak attacks.
PCDefense is unique because it doesn't need extra models or processing steps. Instead, the defense prompts are added directly to the input to adjust biases and get more balanced behavior from the language model.
The researchers tested PCDefense on various models and showed that jailbreak attempt success rates could be significantly reduced for both privileged and marginalized groups. At the same time, the gap between groups decreased, suggesting a reduction in safety-related biases.
According to the researchers, PCDefense offers an efficient and scalable way to improve the safety of large language models without additional compute.
Open-source code available
The study's findings underscore the complexity of designing safe and ethical AI systems that balance safety, fairness, and performance. Fine-tuning specific safety guardrails can degrade the overall performance of AI models, such as their creativity.
To enable further research and improvements, the authors have made the code and all associated artifacts of PCJailbreak available as open source. Theori Inc, the company behind the research, is a cybersecurity company specializing in offensive security based in the US and South Korea. It was founded in January 2016 by Andrew Wesie and Brian Pak.