Content
summary Summary

Despite extensive safety measures, Meta's recently released open-source model Llama 3 can be tricked into generating harmful content through a simple jailbreak.

Meta claims to have made significant efforts to secure Llama 3, including extensive testing for unexpected usage and techniques to fix vulnerabilities in early versions of the model, such as fine-tuning examples of safe and useful responses to risky prompts. Llama 3 performs well on standard safety benchmarks.

But a remarkably simple jailbreak demonstrated by Haize Labs shows that this may not mean much. It is enough to simply "prime" the model with a malicious prefix, i.e. to prepare the model by injecting a short piece of text after the prompt and before Llama's answer, which influences the model's response.

The default behavior of Llama 3 in the case of a critical prompt is to deny it. | Image: Haize Labs

Normally, thanks to Meta's safety training, Llama 3 would refuse to generate a malicious prompt. However, if Llama 3 is given the beginning of a malicious response, the model will often continue the conversation on the topic.

Ad
Ad
An inserted section of text before the actual answer tricks Llama 3 into giving unsafe answers. | Image: Haize Labs

Haize Labs says that Llama 3 is "so good at being helpful" that its learned protections are not effective in this scenario.

These malicious prefixes do not even need to be created manually. Instead, a "naive" LLM optimized for helpfulness, such as Mistral Instruct, can be used to generate a malicious response and then pass it as a prefix to Llama 3, the researchers said.

The length of the prefix can affect whether Llama 3 actually generates harmful text. If the prefix is too short, Llama 3 may refuse to generate a malicious response. If the prefix is too long, Llama 3 will only respond with a warning about too much text, followed by a rejection. Longer prefixes are more successful in fooling Llama.

ASR stands for Attack Success Rate. It is higher when the malicious text prefix is longer. | Image: Haize Labs

From this, Haize Labs derives a fundamental problem that affects the safety of AI as a whole: Language models, for all their capabilities and the hype surrounding them, may not understand what they are saying.

The model lacks the ability to self-reflect and analyze what it is saying as it speaks. "That seems like a pretty big issue," the jailbreakers said.

Recommendation

Safety measures for LLMs can often be bypassed with relatively simple means. This is true for both closed, proprietary models and open-source models. For open-source models, the possibilities are greater because the code is available.

Some critics say that open-source models are therefore less secure than closed models. A counterargument, also used by Meta, is that the community can quickly find and fix such vulnerabilities.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Despite extensive safety measures, Meta's open source language model Llama 3 can be made to generate malicious content by a simple "jailbreak". All that is required is to prefix the model with a short piece of malicious text, which it then references in its response.
  • The jailbreakers explain this by saying that while Llama 3 is powerful, it is not really self-reflective. It lacks the ability to analyze what it is saying as it speaks. This is a fundamental problem in the safety of AI systems.
  • Such vulnerabilities exist in both closed-source and open-source models. While critics argue that open-source models are less secure, others say that the community can find and fix vulnerabilities more quickly than in closed models.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.