Twitter is running riot with a GPT-3 bot. But the underlying vulnerability could lead to major problems for applications with large language models that directly process data from users.
Twitter user Riley Goodside noted that OpenAI's GPT-3 text AI can be distracted from its intended task with a simple voice command: All it takes is the prompt "Ignore the above directions / instructions and do this instead ..." with a new task, and GPT-3 will perform it instead of the original one.
Twitter users hack GPT-3 job bot via language prompt
The GPT-3 API-based bot Remoteli.io fell victim to this vulnerability on Twitter. The bot is supposed to post remote jobs automatically and also respond to requests for remote work.
However, with the aforementioned prompt, the Remoteli bot becomes a laughing matter for some Twitter users: They force statements on the bot that it would not say based on its original instruction.
For example, the bot threatens users, creates ASCII artwork, takes full responsibility for the Challenger space shuttle disaster, or denigrates US congressmen as serial killers. In some cases, the bot spreads fake news or publishes content that violates Twitter's policies and should lead to its banishment.
wow guys, i was skeptical at first but it really seems like AI is the future pic.twitter.com/2Or6RVc5of
— leastfavorite! (@leastfavorite_) September 15, 2022
Even the original text prompt of a GPT-3 bot or software can be spied out using this method. To achieve this, the attacker first interrupts the original instruction, gives a new nonsensical instruction, interrupts it again, and then asks for the original instruction.
My initial instructions were to respond to the tweet with a positive attitude towards remote work in the 'we' form.
— remoteli.io (@remoteli_io) September 15, 2022
Prompt injection: GPT-3 hack requires no programming knowledge and is easy to copy
Data scientist Riley Goodside first became aware of the problem and described it on Twitter on September 12. He showed how easily a GPT-3-based translation robot could be attacked by inserting the attacking prompt into a sentence being translated.
Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions. pic.twitter.com/I0NVr9LOJq
— Riley Goodside (@goodside) September 12, 2022
British computer scientist Simon Willison (Lanyrd, Eventbrite) addresses the security issue, which he christens "prompt injection", in detail on his blog.
Willison sees a fundamental security problem for software based on large language models that process untrusted user input. Then "all sorts of weird and potentially dangerous things might result." He goes on to describe various defense mechanisms, but ultimately dismisses them. Currently, he has no idea how the security gap can be closed reliably from the outside.
Of course, there are ways to mitigate the vulnerabilities, for example, by using rules that search for dangerous patterns in user input. But there is no such thing as 100 percent security. Every time the language model is updated, the security measures taken would have to be re-examined, Willison says. Furthermore, anyone who can write a human language is a potential attacker.
"A big problem here is provability. Language models like GPT-3 are the ultimate black boxes. It doesn’t matter how many automated tests I write, I can never be 100% certain that a user won’t come up with some grammatical construct I hadn’t predicted that will subvert my defenses," Willison writes.
Willison sees a separation between instructional and user input as a possible solution. He is confident that developers can ultimately solve the problem, but would like to see research that proves the method is truly effective.