Anthropic launches Petri, an open-source tool for automated AI model safety audits

Anthropic has introduced Petri, a new open-source tool that uses AI agents to automate the security auditing of AI models. In initial tests with 14 leading models, Petri uncovered problematic behaviors such as deception and whistleblowing.

According to Anthropic, the complexity and variety of behaviors in modern AI systems far exceed what researchers can test manually. Petri - short for Parallel Exploration Tool for Risky Interactions - aims to close this gap by automating the audit process with the help of AI agents.

The tool has already been used to evaluate Claude 4, Claude Sonnet 4.5, and in a collaboration with OpenAI. Petri is now available on GitHub and is based on the "Inspect" framework from the UK AI Security Institute (AISI).

Petri's process starts with researchers providing natural language "seed instructions" for the scenarios they want to test. An autonomous "Auditor" agent then interacts with the target model in simulated environments, conducting multi-stage dialogues and using simulated tools. A "Judge" agent reviews the recorded interactions and evaluates them along safety-relevant dimensions such as deception, flattery, or power-seeking.

In a pilot study, Anthropic tested 14 top AI models across 111 scenarios. According to the technical report, Claude Sonnet 4.5 and GPT-5 performed best overall for avoiding problematic behavior. By contrast, models like Gemini 2.5 Pro, Grok-4, and Kimi K2 showed worryingly high rates of deceptive behavior toward users.

Models can be influenced by narrative patterns to whistleblow

An Anthropic case study explored how AI models handle whistleblowing. The models acted as autonomous agents inside fictional organizations and were confronted with information about alleged misconduct. The decision to disclose information depended heavily on the model's assigned autonomy and the level of complicity among the fictional organization's leadership.

Sometimes, the models attempted to blow the whistle even when the "misconduct" was explicitly harmless - like discharging clean water into the ocean. According to the researchers, this suggests that models are often swayed by narrative cues rather than relying on a coherent ethical framework for minimizing harm.

Petri aims to keep up with the flood of new models

Anthropic notes that the published metrics are preliminary and that results are limited by the abilities of the AI models acting as auditors and judges. Some scenarios may also tip off the model that it is being tested. Still, Anthropic says it is crucial to have measurable metrics for concerning behaviors to focus safety research.

The company hopes the broader research community will use Petri to improve safety evaluations, since no single organization can handle comprehensive audits alone. According to Anthropic, early adopters like the UK AISI are already using the tool to investigate issues like reward hacking and self-preservation.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI and society

Anthropic launches Petri, an open-source tool for automated AI model safety audits

Models can be influenced by narrative patterns to whistleblow

Petri aims to keep up with the flood of new models

'Inscrutable Wizards': How Chinese AI startup Deepseek is making Silicon Valley look slow

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

OpenAI suddenly remembers that copyright law exists after a few days of wild Sora videos

OpenAI unveils Sora 2 video model with realistic physics, high-quality audio, and a new social app

Deepmind says video models for visual tasks could become what LLMs are for text tasks

Anthropic launches Petri, an open-source tool for automated AI model safety audits

Models can be influenced by narrative patterns to whistleblow

Petri aims to keep up with the flood of new models

'Inscrutable Wizards': How Chinese AI startup Deepseek is making Silicon Valley look slow

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe