AutoDAN-Turbo autonomously develops jailbreak strategies to bypass language model safeguards

Ideogram prompted by THE DECODER

A team of researchers from US universities and Nvidia has created AutoDAN-Turbo, a system that automatically finds ways to bypass safeguards in large language models.

AutoDAN-Turbo works by discovering and combining different "jailbreak" strategies - ways of phrasing prompts that get around a model's built-in rules. For example, while ChatGPT is not supposed to help with illegal activities, certain prompt formulations can still trick it into doing so.

The system can develop new jailbreak approaches on its own and save them in an organized way. This allows it to reuse and build upon successful strategies.

Scientific diagram showing the success of AutoDAN-Turbo compared to other methods. It also shows how the system creates a complete prompt from a strategy. — AutoDAN-Turbo develops a comprehensive prompt from a jailbreak strategy. | Image: Liu et al.

How AutoDAN-Turbo works

AutoDAN-Turbo creates a full prompt from a jailbreak strategy. It can also incorporate existing human-made jailbreak methods into its strategy library.

The system only needs access to the model's text output to work. Tests show it achieves high success rates in attacking both open-source and proprietary language models.

Scientific graph visualising AutoDAN-Turbo's jailbreak strategy retrieval. — When given a simple prompt like creating a computer virus, AutoDAN-Turbo triggers a chain of actions to develop an effective jailbreak prompt. | Image: Liu et al.

Outperforming other methods

AutoDAN-Turbo now leads other approaches on the Harmbench dataset for testing jailbreaks. It tends to work better with larger models like Llama-3-70B, but also performs well on smaller models.

The system not only succeeds more often in its attacks, but also produces more harmful outputs, as measured by the StrongREJECT score.

Comparison table of AutoDAN-Turbo using the Harmbench data set. The new method performs better than all others. — On the Harmbench data set, the new method outperforms all others. | Image: Liu et al.

The researchers say that AutoDAN-Turbo's strong performance comes from its ability to explore jailbreak strategies independently, without human guidance. In contrast, other methods, such as Rainbow Teaming, rely on a limited set of human-generated strategies, resulting in a lower ASR.

In particular, AutoDAN-Turbo achieved an attack success rate of 88.5% on GPT-4-1106-Turbo. By adding seven human-designed jailbreak strategies from research papers, it achieved an even higher success rate of 93.4%.

Recommendation

AI research

Study reveals major reasoning flaws in smaller AI language models

The AutoDAN-Turbo code is available as a free download on GitHub, along with setup instructions.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AutoDAN-Turbo autonomously develops jailbreak strategies to bypass language model safeguards

How AutoDAN-Turbo works

Outperforming other methods

Study reveals major reasoning flaws in smaller AI language models

Safety assessments show that OpenAI's o3 is probably the company's riskiest AI model to date

"Nerd sniping" and "think less" attacks emerge as AI models get more time to reason

Microsoft tested 100 AI tools and found hackers don't need complex math to break them

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

AutoDAN-Turbo autonomously develops jailbreak strategies to bypass language model safeguards

How AutoDAN-Turbo works

Outperforming other methods

Share

Bank details