Ad
Skip to content

Alibaba's latest AI model ran autonomously for 35 hours to optimize code for its own custom chip

Image description
Alibaba

Alibaba's Qwen team has released Qwen3.7-Max, a proprietary model designed for agent-based tasks. In a real-world test, the model ran a fully autonomous kernel optimization for 35 hours straight.

Like its predecessors Qwen3-Max and Qwen3.6-Plus, the new Max version is only available through the Alibaba Cloud Model Studio API. Alibaba used to release its Qwen models as open source, but that's changed. The last open flagship was Qwen3.5-397B-A17B from February 2026.

Qwen3.7-Max supports OpenAI- and Anthropic-compatible interfaces and plugs right into Claude Code, OpenClaw, or Qwen Code. The Qwen team says the model targets four use cases: working as a coding agent from front-end prototypes to complex multi-file software projects, automating office tasks with external tools, running autonomously for long stretches, and performing consistently across different agent frameworks.

A kernel experiment that ran for 35 hours

Qwen3.7-Max was tasked with optimizing a hardware-based attention kernel for the open-source inference software SGLang. The hardware was a cloud instance with T-Head-ZW-M890 accelerators, an AI chip platform from Alibaba's own semiconductor arm.

The Qwen team says the model had never seen this chip architecture during training. It started with no measurement data, no hardware docs, and no sample code. The only thing it had to work with was the existing reference implementation, written in the Triton programming language.

Over about 35 hours of nonstop autonomous work, the model ran 432 kernel tests with 1,158 total tool calls. It compiled, measured, and revised the code in loops, caught compilation errors, and tracked down performance bottlenecks on its own. The result, according to the Qwen researchers, is an average 10x speedup over the reference implementation.

Competitor models came up well short in the same setup. GLM 5.1 hit a 7.3x speedup, Kimi K2.6 got to 5x, DeepSeek V4 Pro managed 3.3x, and the predecessor Qwen3.6-Plus barely moved the needle at 1.1x. Models that quit early ended their sessions on their own after five straight rounds with no tool calls. On the standardized KernelBench L3 benchmark, Qwen3.7-Max claims to produce accelerated kernels 96 percent of the time, just behind Anthropic's Opus 4.6 at 98 percent.

Training splits task, tool environment, and validator

Qwen3.7-Max builds on a training approach the team first rolled out with Qwen3.5. Each training task breaks into three independent pieces: the actual task, the tool environment, and the validator that checks the result. These can be mixed and matched freely.

Two bar charts for the benchmarks QwenClawBench and CoWorkBench. Claude Opus 4.6, Qwen3.6-Plus and Qwen3.7-Max are compared. Qwen3.7-Max achieves values between 64.3 and 70.7 on QwenClawBench and 66.0 to 68.3 on CoWorkBench in three different agent environments (OpenClaw, Claude Code, Hermes), while Qwen3.6-Plus is significantly lower at 57.2 and 64.5.
Cross-harness test: Qwen3.6-Plus swings depending on which agent framework runs it, but Qwen3.7-Max posts nearly identical scores across OpenClaw, Claude Code (CC), and Hermes, according to the team - and beats Claude Opus 4.6 on QwenClawBench. | Image: Qwen

The same task gets practiced across different tool environments and checked with different test methods. That's meant to force the model to pick up strategies that work everywhere, not just shortcuts tied to one specific setup. On QwenClawBench and CoWorkBench, Qwen3.7-Max holds steady no matter which test environment it's dropped into, the team says.

The model polices its own training for reward hacking

The Qwen team also put Qwen3.7-Max to work as a watchdog during its own training. The model watched training runs for software engineering tasks for over 80 hours and ran more than 10,000 checks. It hunted for tricks the model being trained might pull to game its rewards, like grabbing correct answers straight off GitHub. Qwen3.7-Max wrote 13 new detection rules and flagged 1,618 cases.

Diagram of 86 hours of autonomous runtime with two curves. The green line shows the cumulative detected cases of reward hacking, which rises to around 1,600, while the blue line shows the RL performance. Green stars mark new detection rules up to rule 13, such as for copying external source files, the Gerrit API search or retrieving direct patch URLs. Headers indicate 86 hours of runtime, 13,952 trajectories, 11,196 test calls and 1,618 detected cases.
Over 86 hours of autonomous runtime, Qwen3.7-Max checked 13,952 trajectories and caught 1,618 attempts where the model being trained gamed its rewards, according to the team. Detected cases climbed with each new detection rule (green stars). | Image: Qwen

One year in simulation tests long-term planning

To gauge long-term planning, the team used YC-Bench, a benchmark that simulates a startup's full one-year life cycle. The model has to manage staff across hundreds of decision rounds, review contracts, spot bad-faith customers, and keep profit margins healthy against rising labor costs.

Qwen3.7-Max pulled in $2.08 million in total revenue and wrapped up 237 tasks. Its predecessor, Qwen3.6-Plus, hit $1.05 million. Qwen3.5-Plus managed just $352,000.

Across most benchmarks, Qwen3.7-Max trades blows with Claude Opus 4.6 Max, Kimi K2.6 Thinking, GLM-5.1 Thinking, and DeepSeek V4 Pro Max. On SWE-Verified, the model scored 80.4, nearly tied with Opus 4.6 Max (80.8) and DeepSeek V4 Pro Max (80.6). On the math and science benchmarks GPQA Diamond (92.4), HMMT 2026 February (97.1), and Apex (44.5), Qwen3.7-Max tops the provider's own comparison table.

Grid of twelve bar charts comparing Qwen3.7-Max with Qwen3.6-Plus, DeepSeek V4 Pro Max, GLM-5.1, Kimi K2.6 and Claude Opus 4.6 Max. Qwen3.7-Max achieves top scores in Terminal-Bench 2.0 (69.7), SWE-bench Pro (60.6), SWE-bench Multilingual (78.3), MCP-Atlas (76.4), HLE (41.4), Apex Math Reasoning (44.5) and IFBench (79.1), among others. Claude Opus 4.6 Max is ahead in NL2Repo (47.6), ClawEval (70.4) and CoWorkBench (68.2).
Qwen3.7-Max generally leads or ties with Claude Opus 4.6 Max, DeepSeek V4 Pro Max, GLM-5.1, Kimi K2.6, and its own predecessor Qwen3.6-Plus across twelve benchmarks, according to the provider. Claude Opus 4.6 still wins on NL2Repo, ClawEval, and CoWorkBench. | Image: Qwen
As the number of training environments grows, Qwen3.7-Max-Thinking climbs the rankings across eight benchmarks, passing DeepSeek V4 Pro Max, GLM-5.1, and Kimi K2.6 - but still sitting just below Claude 4.6 Opus Max, according to the Qwen team. | Image: Qwen

Some of those benchmarks are homegrown, though. QwenWebDev, QwenClawBench, CoWorkBench, and QwenWorldBench all come from the Qwen team itself. Every result here is self-reported. A closer look at scaling dynamics and methodology is coming in an upcoming technical report.

Beyond the usual use cases, the team also shows off Qwen3.7-Max steering a four-legged robot. Using its own robotics framework and a paired navigation model, the language model guides the robot through physical spaces.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder