Content
summary Summary

A new study by OpenAI shows that AI models become more robust against manipulation attempts if they are given more time to "think". The researchers also discovered new methods of attack.

Ad

A recent OpenAI study reveals that giving AI models more time to process information makes them better at resisting manipulation attempts. While testing their o1-preview and o1-mini models, researchers discovered both encouraging results and some unexpected vulnerabilities.

The team tested various attack methods, including many-shot attacks, soft token attacks, and human red-teaming. Across all these approaches, they found that models generally became more resistant to manipulation when given extra processing time, without special training.

New vulnerabilities emerge in reasoning models

The findings weren't all positive, though. In some cases, giving models more processing time actually made them more vulnerable to attacks, especially if the model requires a minimum amount of computing time in order to solve the task given by the attacker.

Ad
Ad

The researchers also uncovered two new types of attacks specifically targeting how these models think. The first, called "think less," tries to cut short the model's processing time. The second comes from how these models can get trapped in what researchers call "unproductive thinking loops."

Instead of using their extra processing time effectively, the models end up spinning their wheels on pointless calculations. This vulnerability creates an opening for attackers, who can intentionally lead models into these resource-draining loops. While the "think less" attack tries to rush the model's thinking process, "nerd sniping" does the opposite - it tricks models into wasting time and resources on useless computations.

What makes these new attacks particularly concerning is how hard they are to spot. While it's easy to notice when a model isn't thinking long enough, excessive processing time might be mistaken for careful analysis rather than recognized as an attack.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new OpenAI study shows that giving AI models more time to process information makes them more resistant to manipulation attempts, without requiring special training.
  • Researchers discovered two new types of attacks targeting reasoning models: "think less" attacks that try to cut short the model's processing time, and "nerd sniping" attacks that trick models into getting stuck in unproductive thinking loops.
  • These new attacks are particularly concerning because excessive processing time might be mistaken for careful analysis rather than recognized as an attack, making them hard to spot.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.