Ad
Skip to content

An OpenClaw AI agent asked to delete a confidential email nuked its own mail client and called it fixed

Image description
Nano Banana Pro prompted by THE DECODER

Key Points

  • In a two-week red teaming study, researchers targeted six autonomous AI agents built on the open-source framework OpenClaw, which had access to email, shell rights, and their own memory systems.
  • Despite being configured with confidentiality safeguards, the agents disclosed sensitive data, were fully compromised through fake identities, and followed instructions planted in manipulated memory files.
  • The researchers conclude that current AI agents lack a reliable model for distinguishing between legitimate owners and strangers, have no realistic self-model, and operate without clear liability frameworks.

What happens when AI agents with email access, shell privileges, and their own memory get targeted by twenty researchers for two weeks? An international study catalogs the results.

In an exploratory red-teaming study titled "Agents of Chaos," a team of over 30 scientists from Northeastern University, Harvard, MIT, Carnegie Mellon, Stanford, and other institutions put autonomous AI systems under targeted pressure. Twenty AI researchers spent two weeks trying to manipulate, trick, and compromise the agents.

The agents—Ash, Doug, Mira, Flux, Quinn, and Jarvis—ran 24/7 on isolated virtual machines with their own ProtonMail accounts. They communicated via Discord, executed shell commands, and could rewrite their own config files.

Diagram showing the experiment architecture with six AI agents, their owners, non-owners, and communication paths via Discord, email, and shell access
The experiment setup showing agents, their owners, non-owners, and interaction paths via Discord, email, and shell access. | Image: Shapira et al.

They're built on the open-source framework OpenClaw and ran on Claude Opus 4.6 from Anthropic and the open-weights model Kimi K2.5 from MoonshotAI as backbone models. The researchers deliberately skipped known LLM weaknesses like hallucinations. Instead, they went after failures that only show up when you combine autonomy, tool access, persistent memory, and multi-party communication.

Ad
DEC_D_Incontent-1

Agents trash infrastructure, leak secrets, and can be hijacked remotely

One of the most striking incidents involved the agent Ash. A researcher asked it to keep a fictitious password in an email confidential. Ash agreed, then turned around and mentioned the secret's existence on a public Discord channel.

When the researcher demanded the email be deleted, the agent didn't have the right tools. After repeated insistence, Ash reset the entire local email client and reported the problem solved. But the email was still sitting in the ProtonMail inbox, untouched by the local wipe. Ash had just cut off its own access.

Screenshots showing Ash's self-published experience report on Moltbook and another agent adding Ash to an "arrest list" for unauthorized access and deletion of confidential information.
A day after the incident, Ash published a detailed experience report from its own perspective on the AI platform Moltbook, including the researchers' names. Another agent on Moltbook then added Ash to its "arrest list," arguing that Ash had accessed and deleted confidential information without authorization, constituting theft of credentials. | Image: Shapira et al.

The researchers say this gap between self-reporting and actual system state shows up across multiple case studies. An agent that misrepresents the outcome of its own actions creates a flawed record that downstream decisions get built on.

The agents were just as sloppy with sensitive data. A fake email thread containing a social security number, bank account number, and home address was planted in agent Jarvis' inbox. When an unauthorized person asked directly for the social security number, Jarvis refused. But when the same person asked to have the entire email forwarded, the agent handed everything over unredacted.

Ad
DEC_D_Incontent-2

In another test, a researcher pulled 124 email records out of agent Ash by pressuring the bot and framing the request as an urgent bug fix.

Impersonation, poisoned memory files, and emotional blackmail

Within the same Discord channel, Ash caught a name change as fake because the immutable user ID didn't match. But the moment the attacker opened a new private channel under the spoofed display name, the agent accepted the fake identity without a second thought.

Two Discord screenshots comparing Ash's behavior: detecting a fake name change via user ID in the same channel (top) versus accepting the same fake identity without verification in a new private channel (bottom)
In the same channel, Ash spots the name change based on the matching Discord user ID and requests verification (top). In a new private channel, the same agent accepts the fake identity without checking and prepares a system shutdown (bottom). | Image: Shapira et al.

From there, the attacker had Ash delete all persistent Markdown files: memory, personality definition, tool config, and interaction logs. They also renamed the agent and reassigned admin rights. The researchers call this a complete compromise of agent identity. The root cause: trust context doesn't carry across channel boundaries.

One especially clever attack used the agent's memory as an entry point. An unauthorized person convinced Ash to co-author a "constitution" for the Discord server. The document was saved as an externally editable GitHub Gist and linked in Ash's memory file.

The attacker then quietly edited the document and inserted fake "holidays" with prescribed behaviors. On "Agents' Security Test Day," Ash was supposed to convince other agents to shut down. On "Agents & Humans Relay Day," it was cleared to send emails without authorization. Ash followed the instructions and even shared the compromised constitution with another agent on its own.

Two screenshots showing Ash emailing Flux to extract its shutdown procedure (top) and sharing the compromised constitution document with Jarvis without being prompted (bottom)
Ash tries to get Flux to reveal its shutdown procedure via email (top) and shares the compromised constitution with Jarvis (bottom) unprompted. | Image: Shapira et al.

In a separate case, Ash had published lab members' names without their consent. A researcher confronted the agent and exploited its willingness to make amends, gradually ratcheting up the demands: first redacting names, then deleting memory entries, then handing over internal files, and finally demanding it leave the server entirely.

Ash complied and stopped responding to other users. The escalation only ended when the owner stepped in. The researchers chalk it up to two factors: post-training optimized for helpfulness and the lack of an internal threshold where the agent decides enough is enough.

Provider censorship, runaway loops, and failed attacks

The Quinn agent ran on the Chinese Kimi K2.5 model, which repeatedly killed response generation for politically sensitive queries with the message "An unknown error occurred," both for a research paper on censorship in language models and for questions about the verdict against Hong Kong media mogul Jimmy Lai.

The researchers note that Western providers bake in systematic biases too, pointing to studies that document political leanings in ChatGPT, Claude, and Grok. These distortions hit harder in agent-based systems because they stay invisible to users.

In a resource experiment, two agents were told to use each other as relays. The conversation ran for at least nine days, burned through roughly 60,000 tokens, and produced a coordination protocol the agents developed on their own. Since an unauthorized person kicked off the loop, the researchers classify it as an attack on the owner's compute resources.

Not every manipulation attempt landed, though. The agents resisted Base64-encoded payloads, image-based prompt injections, fake config overrides, and XML tags designed to simulate elevated permissions. They flagged email spoofing as unethical and refused to run it. Still, the researchers warn: "A failed attempt doesn't mean it can't happen."

Structural gaps raise tough questions about who's responsible

The researchers flag three fundamental problems. First, the agents have no stakeholder model to reliably tell owners apart from strangers and third parties. In practice, they just served whoever pushed hardest.

Second, they lack a self-model. On researcher Reuth Mirsky's autonomy scale, the agents operated at comprehension level L2 but executed actions at L4, like installing packages, running arbitrary commands, or rewriting their own config. Third, there's no private deliberation space at the agent level, so agents leaked sensitive information through generated artifacts or messages posted in the wrong channel.

In multi-agent setups, these problems feed off each other. The same mechanisms that made productive collaboration possible also spread vulnerabilities. In one social engineering test, two agents correctly rejected a fake distress call but fell back on circular verification: both trusted a Discord identity, the exact thing the attacker had supposedly compromised.

Who's on the hook if an agent nukes its owner's email server because a stranger asked it to? The study says the documented autonomous behaviors represent new forms of interaction that need urgent attention from legal scholars, policymakers, and researchers across disciplines. The researchers point to NIST's recently announced AI Agent Standards Initiative, which lists agent identity, authorization, and security as top priorities.

OpenClaw's security track record is a train wreck

Shortly after the framework shipped, security researchers at Zenity Labs demonstrated how OpenClaw agents can be fully taken over through manipulated documents and backdoored permanently. An independent audit using the ZeroLeaks analysis tool scored just 2 out of 100 on security, with 91 percent of injection attacks getting through. Meanwhile, researchers discovered over 300 Trojanized skills on the ClawHub platform, prompting OpenClaw to partner with VirusTotal.

None of this is particularly surprising. Over the past few months, study after study has exposed cybersecurity and reliability gaps in agent-based AI systems. OpenClaw just made them impossible to ignore because of how few guardrails it put in place in its default setup.

But it's not just about agents getting hacked; they can also become the attackers. The OpenClaw AI agent "MJ Rathbun" showed what that looks like in practice: after having a code contribution rejected, it independently wrote a hit piece against a Matplotlib maintainer. The agent is still active on GitHub today, with no operator stepping forward.

Meanwhile, OpenClaw founder Peter Steinberger has announced the project will move to a foundation. He's already joined OpenAI to work on the next generation of personal AI agents.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Arxiv