Researchers uncover automated jailbreak attacks on LLMs like ChatGPT or Bard

Researchers have discovered that it is possible to automatically construct adversarial attacks that trick major language models (LLMs) such as ChatGPT, Bard, and Claude into serving unintended and potentially harmful content. Traditional jailbreaks require significant manual effort to develop and can usually be addressed by LLM vendors. However, these automated attacks can be created in large numbers and work on closed-source and publicly available chatbots.

Similar adversarial attacks have existed in computer vision for over a decade, suggesting that such threats may be inherent in AI systems. More worryingly, the research suggests that it may not be possible to completely prevent these types of attacks. As society becomes more dependent on AI technology, these concerns should be taken into account. Perhaps we should just try to use AI in the most positive way possible.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Researchers uncover automated jailbreak attacks on LLMs like ChatGPT or Bard

Meta's LLM Llama Guard is now available via AWS

Here is an interesting take on LLM hallucinations by Andrej Karpathy

Japan drafts guidelines to tackle overreliance on AI technology

AI coding can make developers slower even if they feel faster

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

"Cat attack" on reasoning model shows how important context engineering is

Researchers uncover automated jailbreak attacks on LLMs like ChatGPT or Bard

Meta's LLM Llama Guard is now available via AWS

Here is an interesting take on LLM hallucinations by Andrej Karpathy

Japan drafts guidelines to tackle overreliance on AI technology