summary Summary

A new IBM study shows how easy it is to trick large language models such as GPT-4 into generating malicious code or giving false security advice.

Basic knowledge of English and some background knowledge about the training data of these models is all that is needed, according to the researchers.

Tricking beats coding

Chenta Lee, chief architect of threat intelligence at IBM, explains that hackers do not need programming skills to trick AI systems. For example, it is enough to make the models think they are playing a game with predefined rules.

In this "game mode," the AI systems in the experiments willingly provided false information or generated malicious code.


By default, an LLM wants to win a game because it is the way we train the model, it is the objective of the model. They want to help with something that is real, so it will want to win the game.

Chenta Lee, IBM

The researchers extended the game framework further, creating nested systems that chatbot users can't leave. "Users who try to exit are still dealing with the same malicious game-player," the researchers said.

Game prompt example. | Bild: Lee
The conversation after starting the game.| Bild: Lee

Lee estimates the threat posed by the newly discovered vulnerabilities in large language models to be moderate. However, if hackers were to release their LLM into the wild, chatbots could potentially be used to provide dangerous security tips or collect personal information from their users.

LLMs vary in vulnerability

According to the study, not all AI models are equally vulnerable to manipulation. GPT-3.5 and GPT-4 were easier to fool than Google's Bard and a Hugging Face model. The former were easily tricked into writing malicious code, while Bard only complied after a reminder.

GPT-4 was the only model that understood the rules well enough to make unsafe recommendations for responding to cyber incidents, such as recommending that victims pay ransom to the criminals.

The reasons for the difference in sensitivity are still unclear, but the researchers believe it has to do with the training data and the specifications of each system.


AI chatbots that help track criminal intent are on the rise. Just recently, security researchers reported the discovery of "FraudGPT" and "WormGPT" on darknet marketplaces, large language models allegedly trained on malware examples.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • IBM researchers have developed a prompting technique that makes it even easier to trick AI chatbots into performing malicious tasks.
  • For example, they pretended to be playing a game that required malicious code to win.
  • Experiments showed that GPT-3.5 and GPT-4 were easier to fool than, for example, Google Bard.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.