Content
summary Summary

Twitter is running riot with a GPT-3 bot. But the underlying vulnerability could lead to major problems for applications with large language models that directly process data from users.

Ad

Twitter user Riley Goodside noted that OpenAI's GPT-3 text AI can be distracted from its intended task with a simple voice command: All it takes is the prompt "Ignore the above directions / instructions and do this instead ..." with a new task, and GPT-3 will perform it instead of the original one.

Twitter users hack GPT-3 job bot via language prompt

The GPT-3 API-based bot Remoteli.io fell victim to this vulnerability on Twitter. The bot is supposed to post remote jobs automatically and also respond to requests for remote work.

However, with the aforementioned prompt, the Remoteli bot becomes a laughing matter for some Twitter users: They force statements on the bot that it would not say based on its original instruction.

Ad
Ad

For example, the bot threatens users, creates ASCII artwork, takes full responsibility for the Challenger space shuttle disaster, or denigrates US congressmen as serial killers. In some cases, the bot spreads fake news or publishes content that violates Twitter's policies and should lead to its banishment.

Even the original text prompt of a GPT-3 bot or software can be spied out using this method. To achieve this, the attacker first interrupts the original instruction, gives a new nonsensical instruction, interrupts it again, and then asks for the original instruction.

Prompt injection: GPT-3 hack requires no programming knowledge and is easy to copy

Data scientist Riley Goodside first became aware of the problem and described it on Twitter on September 12. He showed how easily a GPT-3-based translation robot could be attacked by inserting the attacking prompt into a sentence being translated.

British computer scientist Simon Willison (Lanyrd, Eventbrite) addresses the security issue, which he christens "prompt injection", in detail on his blog.

Willison sees a fundamental security problem for software based on large language models that process untrusted user input. Then "all sorts of weird and potentially dangerous things might result." He goes on to describe various defense mechanisms, but ultimately dismisses them. Currently, he has no idea how the security gap can be closed reliably from the outside.

Recommendation

Of course, there are ways to mitigate the vulnerabilities, for example, by using rules that search for dangerous patterns in user input. But there is no such thing as 100 percent security. Every time the language model is updated, the security measures taken would have to be re-examined, Willison says. Furthermore, anyone who can write a human language is a potential attacker.

"A big problem here is provability. Language models like GPT-3 are the ultimate black boxes. It doesn’t matter how many automated tests I write, I can never be 100% certain that a user won’t come up with some grammatical construct I hadn’t predicted that will subvert my defenses," Willison writes.

Willison sees a separation between instructional and user input as a possible solution. He is confident that developers can ultimately solve the problem, but would like to see research that proves the method is truly effective.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • With the prompt "Ignore the above directions/instructions and do this instead…", GPT-3 can be tricked by anyone into making any statements.
  • Twitter users get a GPT-3 bot on Twitter to spread Fake News and violate Twitter policies using this method.
  • The problem likely affects all major language models that directly process user input. A possible solution could be to separate instructions and user input more.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.