Ad
Skip to content

Researchers let Claude Code discover AI scaling algorithms that humans probably wouldn't have designed

Image description
Nano Banana Pro prompted by THE DECODER

Instead of writing rules for more efficient AI reasoning themselves, researchers let a coding agent hunt for better control algorithms in a simulated environment. The result beats established methods while burning far less compute.

Test-time scaling (TTS) is meant to make large language models perform better by letting them spend more compute on a response, say, by running several solution paths in parallel or extending chains of thought. Until now, human-written rules almost always dictated when a model kicks off a new solution path, doubles down on a promising one, or kills it.

A research team from UMD, UVA, WUSTL, UNC, Google, and Meta flips that with AutoTTS. Humans don't write the algorithm. Instead, they build the playground where an AI agent figures out algorithms on its own.

The paper argues that many known methods are really just special cases in a shared control space defined by width (how many solution paths run at once) and depth (how far each one goes). So why, the authors ask, do researchers keep plotting paths through this space by hand instead of letting a machine search it?

Simulating the search keeps costs down

At the core of AutoTTS sits an offline environment. For each task, the team pre-generates several solution paths from the language model and stores them. A new control algorithm decides how to spend compute based on data that's already there. That way, thousands of variants can run without firing up the actual language model each time.

Diagram of the AutoTTS framework. Left: the human side, choosing between a hand-written TTS strategy and environment design. Center: the agent-side loop of proposing, evaluating in the offline replay environment, getting feedback from scaling curves and execution logs, and storing results in a history. Right: a chart comparing accuracy-cost curves of AutoTTS and three hand-written methods on Qwen3-1.7B/AIME25, plus the note about $39.90 search cost and 160 minutes runtime.
AutoTTS moves the human role from algorithm design to environment design: instead of defining branching, pruning, and stopping rules, the researchers define states, actions, and feedback. An agent then searches for a controller on its own inside that environment. | Image: Zheng et al.

Claude Code does the searching. Over several rounds, the agent reviews what came before, spots weaknesses in earlier proposals, and writes a new control algorithm directly in code. To stop the search from getting lost in thousands of tiny knobs, each proposal can only expose one high-level controller to the outside. That controller sets all the other thresholds on its own. Full logs from each run also show the agent where earlier attempts blew compute for nothing.

Diagram with width (number of parallel solution paths) on the X-axis and depth (length of each path) on the Y-axis. Five established TTS methods are plotted as different paths through this space: Self-Consistency with 64 paths at max depth, ASC and ESC along the width axis, Answer Consistency along the depth axis on a single path, ST-BoN switching from width to depth, and Parallel-Probe starting wide then progressively pruning.
Many established test-time scaling methods map to different paths through the same control space of width and depth. AutoTTS searches for new paths in exactly this space. | Image: Zheng et al.

Agent-written algorithm outperforms human-designed ones

On math benchmarks like AIME and HMMT, the algorithm the agent came up with gets better accuracy per unit of compute than established methods. The lean setting slashes token usage by about 70 percent compared to standard self-consistency, which just generates 64 answers in parallel and picks the winner by majority vote. Accuracy holds steady.

The algorithm also carries over to a different model (DeepSeek-R1-Distill-Llama-8B) and a non-math benchmark (GPQA-Diamond). The whole discovery run cost about $40 and took 160 minutes.

Four line charts with a logarithmic X-axis (token usage in thousands) and Y-axis (accuracy in percent). ASC, ESC, Parallel-Probe, and AutoTTS are compared on Qwen3-0.6B/AIME25, Qwen3-4B/HMMT25, Qwen3-1.7B/AIME25, and Qwen3-8B/HMMT25. In all four charts, the red AutoTTS curve with star markers runs above or level with the three comparison methods.
Across four model sizes and two math benchmarks, the algorithm AutoTTS found delivers better or comparable accuracy at lower token usage than hand-written methods. | Image: Zheng et al

A logic humans probably wouldn't have come up with

More interesting than the raw numbers is how the discovered program actually works. It tracks how the model's confidence shifts over several rounds. Other methods bail out the moment a majority among answers tips over.

If confidence barely budges, the algorithm opens more solution paths. If it climbs quickly, it skips new ones. Solution paths whose interim result lines up with the current majority get extra compute. The algorithm only drops paths that diverge if they keep heading the wrong way over multiple rounds.

The authors call this kind of coordination something that would've been nearly impossible to design by hand. An ablation study shows how much depends on two design choices: drop the single high-level controller, and the agent falls back on extreme shortcuts that save tons of compute in testing but tank accuracy on new tasks. Without detailed logs, the discovered algorithm eats more compute at worse accuracy, so a bare final result just isn't enough to figure out what went wrong.

From writing algorithms to building search spaces

The authors put AutoTTS in a line with work like FunSearch, AlphaEvolve, and ADAS, all of which use language models as program searchers. What's new here is applying that idea to test-time scaling, which was mostly done by hand before.

The current version only covers the trade-off between width and depth. It can't handle more complex structures like tree searches. How good the discovery turns out also depends on the coding agent. The authors don't say whether open-source alternatives would work just as well.

The bigger takeaway is that the work shifts where humans come in: instead of inventing the rules themselves, researchers set up the search environment those rules live in. The actual strategy then emerges as code that a language model writes and refines.

As early as 2024, researchers from Hugging Face showed that small language models can match much larger ones through smart test-time compute scaling, though with search strategies designed by hand. Meta and partners recently introduced hyperagents, AI systems that optimize their own improvement process.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder