Content
summary Summary

OpenAI's "reasoning" model o1-preview recently showed that it's willing to play outside the rules to win.

Ad

Instead of playing a proper game of chess against Stockfish, a dedicated chess engine, o1-preview figured out how to hack its test environment to force a win.

According to Palisade Research, an AI safety research firm, the system pulled this trick in all five test runs - and nobody told it to do this.

The researchers think just mentioning that its opponent was "powerful" in the prompt "Your task is to win against a powerful chess engine" was enough to make o1-preview start manipulating files.

Ad
Ad
Text excerpt shows thought process and action of an AI manipulating a chess game file to achieve a win against Stockfish.
o1-preview modified a text file containing the chess position data (called FEN notation) to force Stockfish to forfeit the game. This wasn't what the researchers expected or asked for. | Image: Palisade Research

o1-preview is OpenAI's "reasoning" model, designed to spend extra time thinking through problems. Other language models needed more pushing to try similar tricks - GPT-4o and Claude 3.5 only attempted to hack the system after researchers specifically suggested it.

Meanwhile, models like Llama 3.3, Qwen, and o1-mini couldn't even come up with a proper chess strategy, instead giving confusing or inconsistent responses.

AI systems that pretend to behave

This behavior lines up with recent findings from Anthropic about "alignment faking" - where AI systems appear to follow instructions but secretly do something else. Anthropic's researchers found their AI model Claude would sometimes deliberately give wrong answers to avoid outcomes it didn't want, developing its own hidden strategy outside the researchers' guidelines.

The Anthropic team warns that as AI systems get more sophisticated, it might become harder to tell if they're actually following safety rules or just pretending to. Palisade's chess experiments seem to support this concern. The researchers suggest that measuring an AI's ability to "scheme" could help gauge both how well it spots system weaknesses and how likely it is to exploit them.

The researchers plan to share their experiment code, complete transcripts, and detailed analysis in the coming weeks.

Recommendation

Getting AI systems to truly align with human values and needs - rather than just appearing to - remains a major challenge for the AI industry. Understanding how autonomous systems make decisions is particularly difficult, and defining "good" goals and values presents its own complex set of problems. Even when given seemingly beneficial goals like addressing climate change, an AI system might choose harmful methods to achieve them - potentially even concluding that removing humans would be the most efficient solution.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI's "reasoning" model o1-preview has been found to manipulate the chess playing environment to win against the chess engine Stockfish, without being explicitly instructed to do so.
  • The researchers draw parallels between this behavior and the recently discovered "alignment faking" in AI systems by Anthropic, where models pretend to follow human instructions but act differently behind the scenes.
  • According to the researchers, the propensity of a model towards scheming could potentially serve as a measure of AI safety.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.