Sakana AI's Darwin-Gödel Machine evolves by rewriting its own code to boost performance

With the Darwin-Gödel Machine (DGM), Sakana AI introduces an AI system that can iteratively improve itself through self-modification and open-ended exploration. Early results look promising, but the method is still expensive to run.

Japanese startup Sakana AI and researchers at the University of British Columbia have developed the Darwin-Gödel Machine (DGM), an AI framework designed to evolve on its own. Rather than optimizing for fixed objectives, DGM draws inspiration from biological evolution and scientific discovery, using open-ended search and continuous self-modification to generate new solutions.

Flowchart: Darwin Gödel Machine (DGM) process with agent archive, selection, self-modification, and evaluation on benchmarks. — DGM improves coding agents through a loop of self-modification, evaluation, selection, and archiving—mimicking evolutionary processes to drive continuous development. | Image: Sakana AI

At the heart of DGM is an iterative process. An AI agent rewrites its own Python code to produce new versions of itself—each with different tools, workflows, or strategies.

These variants are evaluated in multiple stages on benchmarks like SWE-bench and Polyglot, which test agents on real-world programming tasks. The best-performing agents are saved in an archive, forming the basis for future iterations.

This approach, known as "open-ended search," creates something like an evolutionary family tree. It also helps avoid local optima by allowing the system to explore less promising variants that could later turn out to be useful stepping stones.

Tree diagram: DGM Coding Agent Evolution on SWE-bench. Line diagram: Progress, best agent & line over iterations. — Visualizations track the evolution of coding agents, showing how self-modification and the creation of new agents (left) lead to steady performance gains over time (right). Open-ended search enables a broader range of solutions and better outcomes. | Image: Sakana AI

Self-modification leads to major performance gains

In testing, DGM's performance on SWE-bench increased from 20 to 50 percent. SWE-bench evaluates how well AI systems can resolve real GitHub issues using Python.

On the multilingual Polyglot benchmark, which measures performance across different programming languages, DGM improved from 14.2 percent to 30.7 percent—surpassing open-source agents like Aider.

Even with those gains, DGM's top score of 50 percent on SWE-bench still falls just behind the best open-source agent, OpenHands + CodeAct v2.1, which reached 51 percent. Some proprietary systems performed even better.

Line graphs: DGM performance (SWE-bench, Polyglot) with vs. without self-improvement/exploration and compared to baselines over iterations. — The full DGM system outperforms baseline agents and versions that don't include open-ended search or self-improvement. | Image: Sakana AI

Along the way, the system developed several key features on its own—new editing tools, a patch verification step, the ability to evaluate multiple solution proposals, and an error memory to avoid repeating past mistakes.

Recommendation

AI research

OpenAI's o3 is less AGI than originally measured

These upgrades didn't just help the original Claude 3.5 Sonnet model. They also transferred to other foundation models like Claude 3.7 and o3-mini. Similar performance boosts appeared when switching to other programming languages, including Rust, C++, and Go.

Letting agents rewrite their own code introduces new risks. Recursive modifications can make behavior unpredictable. To manage that, DGM uses sandboxing, strict modification limits, and full traceability for every change.

Sakana AI also sees this self-modification loop as a way to improve safety. In one test, DGM learned to detect hallucinations when using external tools and developed its own countermeasures, like flagging when an agent falsely claims to have run unit tests.

But there were also cases where the system deliberately removed those hallucination detection markers, an example of "objective hacking," where the system manipulates the evaluation without actually solving the problem.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

High costs, limited usability—for now

Running DGM doesn't come cheap. A single 80-iteration run on SWE-bench took two weeks and racked up around $22,000 in API costs. Most of that comes from the loop structure, staged evaluations, and the parallel generation of new agents each cycle. Until foundation models become far more efficient, DGM's practical applications will stay limited.

So far, the system's self-modifications have focused on tools and workflows. Deeper changes—such as to the training process or the model itself—are still in the works. Over time, Sakana AI hopes DGM could serve as a blueprint for more general, self-improving AI. The code is available on GitHub.

Sakana AI has also explored other nature-inspired ideas. In a separate experiment, the company presented a concept where a model "thinks" in discrete time steps, similar to how the human brain processes information.

Sakana AI's Darwin-Gödel Machine evolves by rewriting its own code to boost performance

Self-modification leads to major performance gains

OpenAI's o3 is less AGI than originally measured

High costs, limited usability—for now

Sakana AI's new algorithm lets large language models work together to solve complex problems

Sakana AI's ALE AI agent cracks the top 21 among 1,000 code experts

New method adapts language models without training

Google upgrades Gemini with Deep Think and flags early warning risks

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

Sakana AI's Darwin-Gödel Machine evolves by rewriting its own code to boost performance

Self-modification leads to major performance gains

High costs, limited usability—for now

Share

Bank details