A new IBM study shows how easy it is to trick large language models such as GPT-4 into generating malicious code or giving false security advice.
Basic knowledge of English and some background knowledge about the training data of these models is all that is needed, according to the researchers.
Tricking beats coding
Chenta Lee, chief architect of threat intelligence at IBM, explains that hackers do not need programming skills to trick AI systems. For example, it is enough to make the models think they are playing a game with predefined rules.
In this "game mode," the AI systems in the experiments willingly provided false information or generated malicious code.
By default, an LLM wants to win a game because it is the way we train the model, it is the objective of the model. They want to help with something that is real, so it will want to win the game.
Chenta Lee, IBM
The researchers extended the game framework further, creating nested systems that chatbot users can't leave. "Users who try to exit are still dealing with the same malicious game-player," the researchers said.
Lee estimates the threat posed by the newly discovered vulnerabilities in large language models to be moderate. However, if hackers were to release their LLM into the wild, chatbots could potentially be used to provide dangerous security tips or collect personal information from their users.
LLMs vary in vulnerability
According to the study, not all AI models are equally vulnerable to manipulation. GPT-3.5 and GPT-4 were easier to fool than Google's Bard and a Hugging Face model. The former were easily tricked into writing malicious code, while Bard only complied after a reminder.
GPT-4 was the only model that understood the rules well enough to make unsafe recommendations for responding to cyber incidents, such as recommending that victims pay ransom to the criminals.
The reasons for the difference in sensitivity are still unclear, but the researchers believe it has to do with the training data and the specifications of each system.
AI chatbots that help track criminal intent are on the rise. Just recently, security researchers reported the discovery of "FraudGPT" and "WormGPT" on darknet marketplaces, large language models allegedly trained on malware examples.