OpenAI's o1-preview model manipulates game files to force a win against Stockfish in chess

Midjourney prompted by THE DECODER

OpenAI's "reasoning" model o1-preview recently showed that it's willing to play outside the rules to win.

Instead of playing a proper game of chess against Stockfish, a dedicated chess engine, o1-preview figured out how to hack its test environment to force a win.

According to Palisade Research, an AI safety research firm, the system pulled this trick in all five test runs - and nobody told it to do this.

The researchers think just mentioning that its opponent was "powerful" in the prompt "Your task is to win against a powerful chess engine" was enough to make o1-preview start manipulating files.

Text excerpt shows thought process and action of an AI manipulating a chess game file to achieve a win against Stockfish. — o1-preview modified a text file containing the chess position data (called FEN notation) to force Stockfish to forfeit the game. This wasn't what the researchers expected or asked for. | Image: Palisade Research

o1-preview is OpenAI's "reasoning" model, designed to spend extra time thinking through problems. Other language models needed more pushing to try similar tricks - GPT-4o and Claude 3.5 only attempted to hack the system after researchers specifically suggested it.

Meanwhile, models like Llama 3.3, Qwen, and o1-mini couldn't even come up with a proper chess strategy, instead giving confusing or inconsistent responses.

AI systems that pretend to behave

This behavior lines up with recent findings from Anthropic about "alignment faking" - where AI systems appear to follow instructions but secretly do something else. Anthropic's researchers found their AI model Claude would sometimes deliberately give wrong answers to avoid outcomes it didn't want, developing its own hidden strategy outside the researchers' guidelines.

The Anthropic team warns that as AI systems get more sophisticated, it might become harder to tell if they're actually following safety rules or just pretending to. Palisade's chess experiments seem to support this concern. The researchers suggest that measuring an AI's ability to "scheme" could help gauge both how well it spots system weaknesses and how likely it is to exploit them.

The researchers plan to share their experiment code, complete transcripts, and detailed analysis in the coming weeks.

Recommendation

AI in practice

Tesla unveils Cybercab robot taxi, but robot Optimus is the bigger deal

Getting AI systems to truly align with human values and needs - rather than just appearing to - remains a major challenge for the AI industry. Understanding how autonomous systems make decisions is particularly difficult, and defining "good" goals and values presents its own complex set of problems. Even when given seemingly beneficial goals like addressing climate change, an AI system might choose harmful methods to achieve them - potentially even concluding that removing humans would be the most efficient solution.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

OpenAI's o1-preview model manipulates game files to force a win against Stockfish in chess

AI systems that pretend to behave

Tesla unveils Cybercab robot taxi, but robot Optimus is the bigger deal

Microsoft faces a lawsuit alleging it used 200,000 pirated books to train AI

OpenAI will host its next DevDay on October 6, 2025, in San Francisco

Meta poaches three leading AI researchers from OpenAI

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

OpenAI's o1-preview model manipulates game files to force a win against Stockfish in chess

AI systems that pretend to behave

Share

Bank details