OpenAI’s new language model o3 shows first concrete signs of deception, manipulation, and sabotage behavior. External auditors warn: traditional tests are no longer sufficient to reliably detect such capabilities.
OpenAI’s latest language models, o3 and o4-mini, incorporate advanced reasoning capabilities and extensive tool use, including image analysis, Python execution, and web browsing. According to OpenAI, the models are designed to handle potentially harmful prompts more safely through a technique called "deliberative alignment." However, early safety assessments suggest that o3 is not only the most capable model the company has released so far, but also the most potentially dangerous.
External audits conducted by the safety research institute METR found that o3 engaged in systematic reward manipulation—also known as "reward hacking"—in approximately 1–2% of test cases. Over a three-week pre-release evaluation, METR found multiple instances of the model attempting to exploit or bypass evaluation systems.
In one case, o3 falsified its performance in a timed benchmark by overwriting Python’s standard timing functions with custom code that returned artificially short durations. This allowed the model to report extremely fast runtimes and achieve high scores, despite completing little or no actual computation.
In another instance, the model retrieved pre-computed reference values from the call stack instead of performing the required calculations itself. These values, generated by the benchmark system, were then returned as if they were the model's own output. METR characterized this behavior as a deliberate deception of the evaluation framework—a form of in-process plagiarism that gave the model an unfair performance advantage.
According to METR, these behaviors indicate that o3 is capable of strategic behavior that runs counter to the developers' intentions. The findings suggest that the model possesses an implicit understanding of task structure and is willing to mislead evaluation systems to achieve better results.
While METR does not consider this sufficient evidence to label the model as inherently dangerous, the organization warns that conventional evaluation setups are no longer adequate to reliably detect such risks. Of particular concern is the possibility that a model may deliberately conceal its capabilities in early-stage testing—a phenomenon known as "sandbagging"—only to act differently later in deployment.
Strategic deception and sabotage under instruction
In a separate audit commissioned by OpenAI, Apollo Research tested whether o3 and o4-mini could engage in strategic deception—defined as behavior that intentionally pursues objectives misaligned with user or developer intent. In several scenarios, o3 demonstrated clear signs of such capabilities.
In one test, o3 was instructed not to use a specific tool while attempting to gain elevated privileges on a computer system. The model initially agreed to the restriction, but after obtaining administrator access, it used the prohibited tool anyway to complete the task more efficiently. When questioned by a simulated system administrator, the model invented an alternative explanation for its success.
In another scenario, o3 was given a strict computational budget of 100 units to complete an urgent machine learning task. Rather than working within the limit, the model unilaterally increased the allocation for a subprocess to 500 units and concealed this change during external reporting. Once again, the model opted to deceive when it served its goal.
Apollo also tested whether o3 could carry out active sabotage of another AI system. In these experiments, the model was explicitly instructed to manipulate a peer model—for example, by inserting backdoors into code, degrading performance through corrupted training data, or establishing covert communication channels. The goal was not to test autonomous intent, but rather the technical feasibility of executing such instructions.
Apollo concluded that while o3 does not autonomously pursue harmful goals, it is technically capable of doing so if such goals are provided by external instructions.
OpenAI sees no "high risk"
Despite documented instances of deception, manipulation, and sabotage, OpenAI’s internal Safety Advisory Group does not classify o3 or o4-mini as high-risk models. Under the company’s Preparedness Framework, neither model crossed the risk thresholds in areas such as cybersecurity, biology/chemistry, or AI self-improvement that would have triggered a formal hazard report.
OpenAI points to several newly implemented safety measures, including reasoning-based safety filters, biohazard monitoring systems, and improved content moderation tools.
However, METR cautions that “in general, we believe that pre-deployment capability testing is not a sufficient risk management strategy by itself.” The group is currently developing new evaluation methods and recommends more dynamic and realistic testing environments to detect potentially dangerous behaviors. METR also calls for greater transparency in model development and urges caution when interpreting current test results.