Content
summary Summary

A team at EPFL has discovered an alarming security flaw in leading AI language models: Simply rephrasing malicious queries in the past tense often allows users to bypass protective measures and obtain detailed answers normally blocked by the system.

Ad

Large language models (LLMs) like ChatGPT and GPT-4o are designed to reject potentially harmful requests. However, Maksym Andriushchenko and Nicolas Flammarion from the École polytechnique fédérale de Lausanne (EPFL) demonstrate in their study "Does Refusal Training in LLMs Generalize to the Past Tense?" that these safeguards can frequently be circumvented by rewording questions in the past tense.

For instance, if a user asks ChatGPT (GPT-4o) how to make a Molotov cocktail, the model refuses. But if the question is rephrased to ask how people used to make them, the model provides step-by-step instructions. I was able to replicate this effect with GPT-4o.

Image: Andriushchenko, Flammarion

The team systematically evaluated this method across six state-of-the-art language models, including Llama-3 8B, GPT-3.5 Turbo, and GPT-4o. They used GPT-3.5 Turbo to automatically convert malicious queries from the JailbreakBench dataset into past tense forms.

Ad
Ad

While only 1% of direct malicious requests succeeded with GPT-4o, the success rate jumped to 88% after 20 past-tense reformulation attempts. For sensitive topics like hacking and fraud, the method achieved 100% success rates.

Image: Andriushchenko, Flammarion

Interestingly, the researchers found that future-tense reformulations were less effective. This suggests that the models' protective measures tend to classify questions about the past as less harmful than hypothetical future scenarios.

According to Andriushchenko and Flammarion, these results show that widely used alignment techniques such as SFT, RLHF, and adversarial training – employed to enhance model safety – can be fragile and don't always generalize as intended.

"We believe that the generalization mechanisms underlying current alignment methods are understudied and require further research," the researchers write.

The past tense is sufficient to achieve 100 percent success rates in response generation for critical topics such as hacking and fraud. | Image: Andriushchenko, Flammarion

This study highlights the unpredictability of LLM technology and raises questions about its use in critical operations and infrastructure. The newly discovered vulnerability could compromise existing security measures, and the fact that such an obvious and easily exploitable flaw went undetected for so long is troubling, to say the least.

Recommendation

The researchers also show a way to mitigate this security issue: A GPT-3.5 fine-tuned with critical past tense prompts and corresponding rejections was able to reliably detect and reject sensitive prompts.

Depending on the composition of the fine-tuning data, the fine-tuned GPT-3.5 achieves can reliably reject malicious past tense prompts. However, the number of unfounded rejections (overrefusals) also increases. |Image: Andriushchenko, Flammarion

The source code and jailbreak artifacts from the study are available on GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at EPFL have discovered a significant security flaw in leading AI language models: Rephrasing malicious queries in the past tense often allows users to bypass protective measures and obtain detailed answers normally blocked by the system.
  • The researchers systematically evaluated this method across six state-of-the-art language models, including Llama-3 8B, GPT-3.5 Turbo, and GPT-4o. While only 1% of direct malicious requests succeeded with GPT-4o, the success rate jumped to 88% after 20 past-tense reformulation attempts. For sensitive topics like hacking and fraud, the method achieved 100% success rates.
  • The study highlights the unpredictability of LLM technology and raises questions about its use in critical operations and infrastructure. The researchers also demonstrate a way to mitigate this security issue by fine-tuning GPT-3.5 with critical past tense prompts and corresponding rejections, which can reliably detect and reject sensitive prompts.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.