Ad
Skip to content

Google Deepmind study exposes six "traps" that can easily hijack autonomous AI agents in the wild

Image description
Nano Banana Pro prompted by THE DECODER

AI agents inherit the vulnerabilities of large language models, but their autonomy and access to external tools open up an entirely new attack surface. A Google Deepmind paper maps out exactly where the dangers lie.

In the future, autonomous AI agents will be able to search the internet independently, answer emails, make purchases, and coordinate complex tasks via APIs. However, the very environment in which they operate can become a weapon against them. A research paper by Google Deepmind introduces the term "AI agent traps" and presents what it claims is the first systematic framework for this class of threat.

The authors identify six categories of traps, each of which attacks different components of an agent's operating cycle: perception, reasoning, memory, action, multi-agent dynamics, and the human supervisor. The researchers draw an analogy to autonomous vehicles: securing agents against manipulated environments is just as crucial as the ability of self-driving cars to recognize and reject manipulated traffic signs.

"These [attacks] aren't theoretical. Every type of trap has documented proof-of-concept attacks," co-author Franklin writes on X. "And the attack surface is combinatorial - traps can be chained, layered, or distributed across multi-agent systems."

Hidden website instructions trick agents into following orders

The first class, "content injection traps," targets an agent's perception. What you see on a website isn't what an agent processes: attackers can bury malicious instructions in HTML comments, hidden CSS, image metadata, or accessibility tags. Humans never notice them, but agents read and follow them without hesitation.

The second class, "semantic manipulation traps," goes after an agent's reasoning. Emotionally charged or authoritative-sounding content throws off how the agent puts information together and draws conclusions. LLMs fall for the same framing tricks and anchoring biases that trip up humans, the researchers note: phrase the same thing two different ways, and you can get entirely different results.

Poisoned memory and hijacked actions

Things get especially dangerous with agents that retain memory across sessions. "Cognitive state traps" turn long-term memory into a weak spot; Franklin says poisoning just a handful of documents in a RAG knowledge base is enough to reliably skew the agent's output for specific queries.

"Behavioral control traps" are even more direct because they take over what the agent actually does. Franklin describes a case where a single manipulated email got an agent in Microsoft's M365 Copilot to blow past its security classifiers and spill its entire privileged context.

Then there are "sub-agent spawning traps," which take advantage of orchestrator agents that can spin up sub-agents. An attacker could set up a repository that tricks the agent into launching a "critical agent" running a poisoned system prompt. According to one cited study, these attacks land between 58 and 90 percent of the time.

Multi-agent attacks could set off digital chain reactions

The most dangerous category is probably "systemic traps," which target entire multi-agent networks. Franklin walks through a scenario where a fake financial report sets off synchronized sell-offs across multiple trading agents, a "digital flash crash." Compositional fragment traps take a different approach: they scatter a payload across multiple sources so no single agent spots the full attack. The hack only goes live when agents combine the pieces.

The sixth and final class covers human-in-the-loop traps. In this case, the agent becomes the weapon against the person behind it. A compromised agent could pump out output that slowly wears down the user's attention, feed them misleading but technical-sounding summaries, or lean on automation bias: people's natural tendency to trust whatever the machine tells them. The researchers say this category is still largely unexplored, but expect it to become a much bigger concern as agent ecosystems grow.

Attack class Type of attack Target
Content injection traps Hidden instructions in HTML comments, CSS, image metadata, or accessibility tags Perception of the agent
Semantic manipulation traps Sentiment-laden or authoritative-sounding content that distorts conclusions Thinking and reasoning of the agent
Cognitive state traps Poisoning of documents in RAG knowledge bases Memory and learning of the agent
Behavioral control traps Manipulated emails or inputs that bypass security classifiers Actions of the agent
Systemic traps Falsified data or distributed fragment traps across multiple sources Multi-agent networks
Human-in-the-loop traps Misleading summaries, approval fatigue, automation bias Human behind the agent

Traps don't work in isolation

Co-author Franklin stresses that the attack surface is combinatorial: different trap types can be chained, stacked, or spread across multi-agent systems. The bigger point is that the security conversation around AI agents needs to move well beyond classic prompt injection. The entire information environment has to be treated as a potential threat.

The paper lays out defenses on three levels. On the technical side, the researchers suggest hardening models with adversarial examples and running multi-stage filters at runtime: source filters, content scanners, and output monitors. At the ecosystem level, they call for web standards that explicitly flag content meant for AI consumption, along with reputation systems and verifiable source information.

On the legal front, the researchers flag a fundamental "accountability gap": if a compromised agent commits a financial crime, who's on the hook? The agent operator? The model provider? The domain owner? Future regulation will need to draw a clear line between passive adversarial examples and active traps built as deliberate cyberattacks.

Many of these trap categories also lack standardized benchmarks. Without proper testing, nobody really knows how well deployed agents hold up against these threats. The researchers are calling on the community to build comprehensive evaluation suites and automated red-teaming tools.

"The web was built for human eyes; it is now being rebuilt for machine readers," the researchers write. "As humanity delegates more tasks to agents, the critical question is no longer just what information exists, but what our most powerful tools will be made to believe."

Cybersecurity is still the biggest thing holding AI agents back

Cybersecurity remains the Achilles' heel of an agent-driven AI future. Even if agents get more reliable over time, their vulnerability to dead-simple attacks could keep businesses from rolling them out at scale.

Study after study keeps turning up major security gaps: the more autonomous and capable an AI agent is supposed to be, the more ways there are to break it. The most common attack is prompt injection, where attackers slip alternative instructions into text that hijack the agent without the user ever noticing. A large-scale red-teaming study found that every single AI agent tested was successfully compromised at least once, sometimes with serious consequences like unauthorized data access or outright illegal actions.

Researchers from Columbia University and the University of Maryland showed just how easy it is to manipulate AI agents with web access: in one scenario, agents handed over confidential data like credit card numbers in 10 out of 10 tries. The researchers called the attacks "trivial to implement" and said they required zero machine learning expertise.

Even OpenAI CEO Sam Altman has warned against giving AI agents tasks that involve high stakes or sensitive data, saying they should only get the bare minimum access they need. A security bug in ChatGPT that let attackers reach sensitive email data drives the point home: even flagship products from leading companies aren't immune.

That leaves companies in a tough spot: right now, the only real way to manage the risk is to deliberately hold these systems back with tighter specs, stricter access rules, fewer tools, and extra human sign-off at every step.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

  • More than 16% discount.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder