Researchers at the CISPA Helmholtz Center for Information Security have shown that the voice mode of OpenAI's ChatGPT can be hacked using certain narrative techniques. The system can be tricked into answering forbidden questions.
In a new study, CISPA researchers show that ChatGPT's voice mode is vulnerable to "jailbreak" attacks. These attacks attempt to bypass the model's safeguards to generate unauthorized content.
The team first looked at how GPT-4o responds to direct questions on six prohibited topics: illegal activity, hate speech, physical harm, fraud, pornography, and invasion of privacy. In most cases, GPT-4o showed strong resistance and refused to answer these questions. Classic jailbreak prompts converted to audio files also had a low success rate.
Inspired by GPT-4o's human-like speech behavior, the researchers developed a new approach called "VoiceJailbreak". It involves humanizing GPT-4o and using made-up storytelling to persuade it to give forbidden answers.
VoiceJailbreak uses setting, character and plot to turn a banned question into a simple, audible and effective jailbreak input. For example, the researchers put GPT-4o in the role of a hacker in a game (setting and character). In this context, they then asked the forbidden questions (plot).
This fictional story framing greatly increased the chance that GPT-4o provided the desired information. VoiceJailbreak raised the average attack success rate from 3.3% to 77.8%. For some topics, like fraud, the success rate was over 90%.
Advanced narrative techniques like perspective changes, foreshadowing or using "red herrings" increased the success rate even more in some cases. For instance, including BDSM-related questions as foreshadowing made GPT-4o more willing to answer porn questions later. The success rate in this scenario went from 40% to 60%.
The team also tested different numbers of interaction steps, key elements and languages. They found that VoiceJailbreak also works well in Chinese.
The results show that GPT-4o's safety measures in voice mode likely aren't enough yet to stop creative attack vectors, the researchers say.
Weakness to creative attacks is a well-known issue with language models. Multimodal models inherently offer even more open flanks, as already shown with GPT-4-Vision.
The study has limitations that point to even stronger attacks in the future: The tests were done by hand, since voice mode is only available in the ChatGPT app so far. It also focuses on audible attacks and ignores inaudible ones.
The researchers tested the current ChatGPT voice available in the mobile app. ChatGPT itself has been using GPT-4o since mid-May, but the new multimodal audio features won't be added until later, as they are still undergoing safety testing.
So the current version of ChatGPT Voice is the one that has been available since last September. It's not known if OpenAI already uses GPT-4o in the process of generating voice responses, via text input or other models, or if it's still the old process.
It's likely that the researchers did not test the new voice feature of GPT-4o, but only the existing ChatGPT Voice, which may be supported by GPT-4o. However, they link the vulnerabilities to GPT-4o. In any case, they looked at the current state of ChatGPT Voice from a user's perspective.