Researchers have discovered that it is possible to automatically construct adversarial attacks that trick major language models (LLMs) such as ChatGPT, Bard, and Claude into serving unintended and potentially harmful content. Traditional jailbreaks require significant manual effort to develop and can usually be addressed by LLM vendors. However, these automated attacks can be created in large numbers and work on closed-source and publicly available chatbots.

Ad

Similar adversarial attacks have existed in computer vision for over a decade, suggesting that such threats may be inherent in AI systems. More worryingly, the research suggests that it may not be possible to completely prevent these types of attacks. As society becomes more dependent on AI technology, these concerns should be taken into account. Perhaps we should just try to use AI in the most positive way possible.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.