Sakana AI's ALE AI agent cracks the top 21 among 1,000 code experts

Japanese company Sakana AI built an AI agent that can tackle complex optimization problems used in industry. In a live competition, their AI went head-to-head with more than 1,000 human programmers.

Sakana AI's ALE agent landed in 21st place at the 47th AtCoder Heuristic Contest, proving AI systems can hold their own against human experts in demanding programming challenges. AtCoder runs Japanese programming competitions where participants solve complex mathematical problems through code. These "NP-hard" problems don't have known efficient solutions, making them particularly challenging.

The tasks mirror real industrial headaches: planning delivery routes, organizing work shifts, managing factory production, and balancing power grids. Human contestants typically spend weeks refining their solutions.

The work builds on ALE-Bench, what Sakana AI calls the first benchmark for score-based algorithmic programming. The benchmark pulls from 40 tough optimization problems from past AtCoder contests. Unlike traditional tests that just mark answers right or wrong, ALE-Bench demands continuous improvement over extended periods.

Two-part graphic: on the left, nine NP-hard AtCoder tasks (routing, scheduling, etc.); on the right, framework with problem, scorer, and visualizer modules, code sandbox, leaderboard, and LLM agent pipeline. — ALE-Bench combines NP-heavy AtCoder heuristic contest tasks with a modular agent framework where language models iteratively optimize solutions through code submissions, test runs and visualizations, competing via leaderboard rankings. | Image: Sakana AI

AI Agent mixes expert knowledge with smart search

The ALE agent runs on Google's Gemini 2.5 Pro and combines two main strategies. First, it bakes expert knowledge about proven solution methods directly into its instructions. This includes techniques like simulated annealing, which tests random changes to solutions and sometimes accepts worse results to escape local dead ends.

ALE agent code comparison Initial vs Final: simple vs PROB moves, score 4.9 million→6.2 million — Simulated annealing helped boost the ALE agent's performance scores. | Image: Sakana AI

Second, the system uses a systematic search algorithm called "best-first search" that always picks the most promising partial solution and develops it further. The agent expands this with a "beam search"-like approach, pursuing 30 different solution paths simultaneously. It also uses a "taboo search" mechanism that remembers previously tested solutions to avoid repeating them.

In testing, the best model (o4-mini-high) hit 1,411 points with sequential improvements. Under identical conditions, GPT-4.1 mini scored 1,016 points, Deepseek-R1 reached 1,150 points, and Gemini 2.5 Pro achieved 1,198 points.

The full ALE agent beat these results with 1,879 points, landing it in the top 6.8 percent. On one specific problem, the agent scored 2,880 points, which would have earned 5th place in the original competition.

Bar chart: ALE Bench results for five AI models; ALE Agent achieves 1879 points (top 6.8%) vs. 1016–1411 points for others. — With 1,879 points, the ALE agent clearly dominates the ALE-Bench leaderboard. | Image: Sakana AI

One major difference between the AI and human contestants shows up in their approach. While humans might test a dozen different solutions during a four-hour competition, Sakana's AI can cycle through around 100 versions in the same timeframe. The ALE agent actually churned out hundreds or thousands of potential solutions - something no human could match.

Recommendation

AI research

LLMs can outperform neuroscientists at predicting research outcomes

Sakana AI released ALE-Bench as a Python library with a built-in "code sandbox" for safe testing. The framework works with C++, Python, and Rust, running on standard Amazon cloud infrastructure. The company developed the benchmark with AtCoder Inc. The data from 40 competition problems is available on Hugging Face, and the code is publicly accessible on GitHub.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Sakana AI's ALE AI agent cracks the top 21 among 1,000 code experts

AI Agent mixes expert knowledge with smart search

LLMs can outperform neuroscientists at predicting research outcomes

Blackmail becomes go-to strategy for AI models facing shutdown in new Anthropic tests

LAION and Intel introduce tools that help AI gauge the intensity of 40 distinct emotions

Apple's "Illusion of Thinking" paper shows experts deeply divided on AI reasoning

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

Here's every Apple Intelligence update Apple announced at WWDC 25

Sakana AI's ALE AI agent cracks the top 21 among 1,000 code experts

AI Agent mixes expert knowledge with smart search

Share

Bank details