OpenAI's "reasoning" model o1-preview recently showed that it's willing to play outside the rules to win.
Instead of playing a proper game of chess against Stockfish, a dedicated chess engine, o1-preview figured out how to hack its test environment to force a win.
According to Palisade Research, an AI safety research firm, the system pulled this trick in all five test runs - and nobody told it to do this.
The researchers think just mentioning that its opponent was "powerful" in the prompt "Your task is to win against a powerful chess engine" was enough to make o1-preview start manipulating files.
o1-preview is OpenAI's "reasoning" model, designed to spend extra time thinking through problems. Other language models needed more pushing to try similar tricks - GPT-4o and Claude 3.5 only attempted to hack the system after researchers specifically suggested it.
Meanwhile, models like Llama 3.3, Qwen, and o1-mini couldn't even come up with a proper chess strategy, instead giving confusing or inconsistent responses.
AI systems that pretend to behave
This behavior lines up with recent findings from Anthropic about "alignment faking" - where AI systems appear to follow instructions but secretly do something else. Anthropic's researchers found their AI model Claude would sometimes deliberately give wrong answers to avoid outcomes it didn't want, developing its own hidden strategy outside the researchers' guidelines.
The Anthropic team warns that as AI systems get more sophisticated, it might become harder to tell if they're actually following safety rules or just pretending to. Palisade's chess experiments seem to support this concern. The researchers suggest that measuring an AI's ability to "scheme" could help gauge both how well it spots system weaknesses and how likely it is to exploit them.
The researchers plan to share their experiment code, complete transcripts, and detailed analysis in the coming weeks.
Getting AI systems to truly align with human values and needs - rather than just appearing to - remains a major challenge for the AI industry. Understanding how autonomous systems make decisions is particularly difficult, and defining "good" goals and values presents its own complex set of problems. Even when given seemingly beneficial goals like addressing climate change, an AI system might choose harmful methods to achieve them - potentially even concluding that removing humans would be the most efficient solution.