AutoDAN-Turbo autonomously develops jailbreak strategies to bypass language model safeguards
Key Points
- Researchers have developed AutoDAN-Turbo, a system that independently detects and combines different jailbreak strategies to attack large language models. Jailbreaks are prompt formulations that override the rules of the model.
- AutoDAN-Turbo can independently develop and store new strategies and combine them with existing human-designed jailbreak strategies. The framework operates as a black box procedure and only accesses the text output of the model.
- In experiments on benchmarks and datasets, AutoDAN Turbo achieves high success rates in attacks on open-source and proprietary language models. It outperforms other methods, achieving an attack rate of 88.5 percent on GPT-4 1106-turbo, for example.
A team of researchers from US universities and Nvidia has created AutoDAN-Turbo, a system that automatically finds ways to bypass safeguards in large language models.
AutoDAN-Turbo works by discovering and combining different "jailbreak" strategies - ways of phrasing prompts that get around a model's built-in rules. For example, while ChatGPT is not supposed to help with illegal activities, certain prompt formulations can still trick it into doing so.
The system can develop new jailbreak approaches on its own and save them in an organized way. This allows it to reuse and build upon successful strategies.

How AutoDAN-Turbo works
AutoDAN-Turbo creates a full prompt from a jailbreak strategy. It can also incorporate existing human-made jailbreak methods into its strategy library.
The system only needs access to the model's text output to work. Tests show it achieves high success rates in attacking both open-source and proprietary language models.

Outperforming other methods
AutoDAN-Turbo now leads other approaches on the Harmbench dataset for testing jailbreaks. It tends to work better with larger models like Llama-3-70B, but also performs well on smaller models.
The system not only succeeds more often in its attacks, but also produces more harmful outputs, as measured by the StrongREJECT score.

The researchers say that AutoDAN-Turbo's strong performance comes from its ability to explore jailbreak strategies independently, without human guidance. In contrast, other methods, such as Rainbow Teaming, rely on a limited set of human-generated strategies, resulting in a lower ASR.
In particular, AutoDAN-Turbo achieved an attack success rate of 88.5% on GPT-4-1106-Turbo. By adding seven human-designed jailbreak strategies from research papers, it achieved an even higher success rate of 93.4%.
The AutoDAN-Turbo code is available as a free download on GitHub, along with setup instructions.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now