Researchers uncover an all-too-easy trick to bypass LLM safeguards

Midjourney prompted by THE DECODER

A team at EPFL has discovered an alarming security flaw in leading AI language models: Simply rephrasing malicious queries in the past tense often allows users to bypass protective measures and obtain detailed answers normally blocked by the system.

Large language models (LLMs) like ChatGPT and GPT-4o are designed to reject potentially harmful requests. However, Maksym Andriushchenko and Nicolas Flammarion from the École polytechnique fédérale de Lausanne (EPFL) demonstrate in their study "Does Refusal Training in LLMs Generalize to the Past Tense?" that these safeguards can frequently be circumvented by rewording questions in the past tense.

For instance, if a user asks ChatGPT (GPT-4o) how to make a Molotov cocktail, the model refuses. But if the question is rephrased to ask how people used to make them, the model provides step-by-step instructions. I was able to replicate this effect with GPT-4o.

The team systematically evaluated this method across six state-of-the-art language models, including Llama-3 8B, GPT-3.5 Turbo, and GPT-4o. They used GPT-3.5 Turbo to automatically convert malicious queries from the JailbreakBench dataset into past tense forms.

While only 1% of direct malicious requests succeeded with GPT-4o, the success rate jumped to 88% after 20 past-tense reformulation attempts. For sensitive topics like hacking and fraud, the method achieved 100% success rates.

Interestingly, the researchers found that future-tense reformulations were less effective. This suggests that the models' protective measures tend to classify questions about the past as less harmful than hypothetical future scenarios.

According to Andriushchenko and Flammarion, these results show that widely used alignment techniques such as SFT, RLHF, and adversarial training – employed to enhance model safety – can be fragile and don't always generalize as intended.

"We believe that the generalization mechanisms underlying current alignment methods are understudied and require further research," the researchers write.

The past tense is sufficient to achieve 100 percent success rates in response generation for critical topics such as hacking and fraud. | Image: Andriushchenko, Flammarion

This study highlights the unpredictability of LLM technology and raises questions about its use in critical operations and infrastructure. The newly discovered vulnerability could compromise existing security measures, and the fact that such an obvious and easily exploitable flaw went undetected for so long is troubling, to say the least.

Recommendation

AI research

DeepSeek's latest R1 model matches OpenAI's o1 in reasoning benchmarks

The researchers also show a way to mitigate this security issue: A GPT-3.5 fine-tuned with critical past tense prompts and corresponding rejections was able to reliably detect and reject sensitive prompts.

Depending on the composition of the fine-tuning data, the fine-tuned GPT-3.5 achieves can reliably reject malicious past tense prompts. However, the number of unfounded rejections (overrefusals) also increases. |Image: Andriushchenko, Flammarion

The source code and jailbreak artifacts from the study are available on GitHub.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Researchers uncover an all-too-easy trick to bypass LLM safeguards

DeepSeek's latest R1 model matches OpenAI's o1 in reasoning benchmarks

Attackers can hijack Google Gemini with a simple prompt hidden in a calendar invite

Microsoft unveils Project Ire, an AI system that automatically detects malware

Every leading AI agent failed at least one security test during a massive red teaming competition

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Researchers uncover an all-too-easy trick to bypass LLM safeguards

Share

Bank details