Ad
Skip to content

New review paper argues code is how AI agents think and act, not just what they produce

Image description
Nano Banana Pro prompted by THE DECODER

Key Points

  • A review by Meta, Stanford, and the University of Illinois Urbana-Champaign finds that code increasingly serves as the foundation on which AI agents reason, act, and coordinate with each other.
  • Central to this shift is a surrounding software layer called the "harness," which provides tools and isolated environments that transform stateless models into functional systems capable of planning, executing, and testing in a continuous loop.
  • Commercial systems like Claude Code and OpenAI's Codex already operate on this principle, but the authors caution against misplaced trust: current software tests are often incomplete and can obscure risks, making more transparent evaluation mechanisms essential.

A new review paper from researchers at the University of Illinois Urbana-Champaign, Meta, and Stanford wants to change how we think about AI agents.

Their argument is that code is the foundation agents use to reason, act, and work together. So the real bottleneck for autonomous systems, they say, becomes the software layer wrapped around the model, which probably makes Gary Marcus very happy.

The authors call this layer the "harness," and it covers everything from tools and interfaces to sandboxed execution environments, memory, testing, permission boundaries, execution loops, and feedback channels. Without it, a language model is just stateless. With it, the model becomes a working agent that can grind through tasks over long stretches.

Overview graphic of the code-as-agent-harness taxonomy showing three levels - harness interface, harness mechanisms, and scaling - along with five application domains: code assistants, GUI/OS agents, scientific discovery, personalization, and embodied agents.
The paper's central overview shows how code acts as an executable, testable, and stateful layer between model and environment. | Image: Ning et al.

Why code is the right format

The authors see code as a running part of agent behavior, and they lay out several reasons why. Code is executable, so model outputs become operations you can actually check. It's traceable because intermediate calculations show up as structured traces the system can read and store. And it persists across steps because the running program logs task progress in a form the agent can pick back up later.

Ad
DEC_D_Incontent-1

The paper splits long-running agent systems into three parts. There's the model's own capabilities, like reasoning and planning. Then there's the infrastructure the system provides.

And finally, the code the agent writes on the fly, everything from test scripts and throwaway helper tools to reusable skills and executable workflows. The authors say these self-generated artifacts haven't gotten nearly enough research attention.

Three layers organize the field

At the first level, code bridges the model and its environment. Methods like Program-of-Thoughts or Chain of Code offload actual computation to executable programs instead of just describing it in words. Other systems, like Code as Policies, turn natural language instructions straight into robot control code.

Diagram of the plan-execute-verify loop with four building blocks: static analysis, sandboxed execution, deterministic verification, and permissioned state transitions from read-only to full access.
Reliability comes from clearly regulated state transitions in a controlled loop around the model. | Image: Ning et al.

The second level covers what keeps an agent reliable across many steps. That means planning, memory, tool use, and a recurring cycle of plan, execute, and verify. The cycle replaces one-off troubleshooting with systematic checks. Plans spell out what the agent intends to change. Execution runs in sandboxed environments with defined permissions. A verification step then decides whether the result gets accepted, revised, or kicked to a human reviewer.

Ad
DEC_D_Incontent-2

The third level is about multiple agents working together. Code collections, tests, and execution logs become a shared workspace where specialized roles like managers, planners, coders, reviewers, and testers split the work. Systems like ChatDev and MetaGPT put this into practice, and according to the researchers it's already shipping in real products. Claude Code can now farm out pull request reviews to a whole team of AI agents that scan for bugs, security flaws, and regressions in parallel without being able to approve changes themselves.

Diagram of multi-agent orchestration showing specialized roles - manager, planner, coder, reviewer, tester, executer, and verifier - with a shared code workspace and various collaboration topologies.
At the third level, specialized agents split the work through a shared code workspace and coordinate tests and execution protocols. | Image: Ning et al.

Production systems already follow this pattern

The authors point to commercial products as examples. Anthropic's Claude Code ties together the local terminal, dev environment, and browser into one workflow where the agent edits files, runs commands, and has to follow permission rules. OpenAI's Codex and GitHub Copilot's coding agents move similar workflows to managed cloud environments, bundling changes through traceable pull request outputs.

How much this layer matters became obvious by accident when Anthropic leaked roughly 500,000 lines of Claude Code's source code. Buried in there was a "dreaming" function for task consolidation and other tricks for steering models as coding agents. Anthropic later got more than 8,000 copies and forks yanked from GitHub through a copyright takedown.

Other AI labs are catching on. Deepseek plans to go head-to-head with Claude Code and Codex through its own product, Deepseek Code, and is building a dedicated "Harness" team in Beijing to handle everything beyond the model, from tool use to planning to storage. The team's core formula is that model plus harness equals AI agent.

These production systems are also turning into training data for the next round of models. Cursor's composer trains with continuous reinforcement learning on real usage traces. OpenAI's Codex-1, GPT-5-Codex, and GPT-5.1-Codex-Max are trained specifically on long, multi-step coding sessions that match the Codex workflow. The line between agent and environment is itself becoming a layer that learns.

Overview of five application domains for code as agent harness with examples, including code assistants like Claude, Codex, and OpenClaw, as well as GUI/OS agents, scientific discovery, personalization, and embodied robot agents.
The same pattern shows up across five domains, from coding assistants to GUI control and robotics. | Image: Ning et al

When the agent starts tweaking its own environment

Several research systems treat the harness itself as something to optimize. AutoHarness auto-generates code that filters out unauthorized actions, while Meta-Harness systematically hunts for better harness variants by using previous versions, their evaluations, and execution logs as a search space. Other approaches dig through telemetry data to revise individual components. Meta's hyperagents go further still, combining task resolution and self-modification in an editable program that optimizes the improvement loop itself.

But the authors flag several open problems holding the field back: more meaningful evaluations beyond raw success rates, checking the substance of results when tests alone don't cut it, harness self-improvement without regressions, shared state across multiple agents, human oversight, and extending to environments with image or sensor data like GUI agents and robots.

They're especially blunt about whether current test criteria are even good enough. Tests can be incomplete, and test programs for graphical interfaces can miss bad intermediate steps. Simulators paper over physical risks. A harness could breed false confidence precisely because it gives visible feedback, and the green checkmark doesn't mean the code is safe. The authors suggest every accepted action should come with docs that spell out which tests actually ran, which areas stayed untested, and which risks remain.

Reliability in autonomous coding agents doesn't come from better repair prompts but from tightly regulated state transitions in a controlled loop around the model, the researchers argue.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Source: Arxiv