A new study by OpenAI shows that AI models become more robust against manipulation attempts if they are given more time to "think". The researchers also discovered new methods of attack.
A recent OpenAI study reveals that giving AI models more time to process information makes them better at resisting manipulation attempts. While testing their o1-preview and o1-mini models, researchers discovered both encouraging results and some unexpected vulnerabilities.
The team tested various attack methods, including many-shot attacks, soft token attacks, and human red-teaming. Across all these approaches, they found that models generally became more resistant to manipulation when given extra processing time, without special training.
New vulnerabilities emerge in reasoning models
The findings weren't all positive, though. In some cases, giving models more processing time actually made them more vulnerable to attacks, especially if the model requires a minimum amount of computing time in order to solve the task given by the attacker.
The researchers also uncovered two new types of attacks specifically targeting how these models think. The first, called "think less," tries to cut short the model's processing time. The second comes from how these models can get trapped in what researchers call "unproductive thinking loops."
Instead of using their extra processing time effectively, the models end up spinning their wheels on pointless calculations. This vulnerability creates an opening for attackers, who can intentionally lead models into these resource-draining loops. While the "think less" attack tries to rush the model's thinking process, "nerd sniping" does the opposite - it tricks models into wasting time and resources on useless computations.
What makes these new attacks particularly concerning is how hard they are to spot. While it's easy to notice when a model isn't thinking long enough, excessive processing time might be mistaken for careful analysis rather than recognized as an attack.