Researchers uncover automated jailbreak attacks on LLMs like ChatGPT or Bard

Jul 28, 2023

Researchers have discovered that it is possible to automatically construct adversarial attacks that trick major language models (LLMs) such as ChatGPT, Bard, and Claude into serving unintended and potentially harmful content. Traditional jailbreaks require significant manual effort to develop and can usually be addressed by LLM vendors. However, these automated attacks can be created in large numbers and work on closed-source and publicly available chatbots.

Similar adversarial attacks have existed in computer vision for over a decade, suggesting that such threats may be inherent in AI systems. More worryingly, the research suggests that it may not be possible to completely prevent these types of attacks. As society becomes more dependent on AI technology, these concerns should be taken into account. Perhaps we should just try to use AI in the most positive way possible.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

More than 16% discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Researchers uncover automated jailbreak attacks on LLMs like ChatGPT or Bard

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.