Google's Nested Learning aims to stop LLMs from catastrophic forgetting

Google Research has introduced "nested learning," a new way to design AI models that aims to mitigate or even avoid "catastrophic forgetting" and support continuous learning.

In their NeurIPS 2025 paper, Google researchers highlight a core problem: large language models can't build new long-term memories after training. After training, these models only keep what's in their current context window or from pretraining. Expanding the window or retraining just delays the problem, like treating amnesia with a bigger notepad.

Current models are mostly static after pretraining. They can perform learned tasks but can’t pick up new abilities beyond their context, leading to so-called catastrophic forgetting. More updates make it worse.

How nested learning borrows from the brain

Like many machine learning advances, nested learning is inspired by neuroscience. The brain runs at different speeds: fast circuits handle the present, slower ones consolidate important patterns into long-term memory.

Most experiences fade quickly; only a few become lasting memories, thanks to neuroplasticity—the brain’s ability to rewire itself while preserving essential information. The authors contrast this with current LLMs, whose knowledge remains limited to their context window or static pretraining.

Schematic representation of brain waves (0.5–100 Hz) and four neural frequency levels in nested learning, which are mapped onto Q/K/V linear layers. — Nested learning groups model components by how often they update, using brain waves as a metaphor. This layered memory setup lets the model take in new information without overwriting what it already knows. | Image: Google

Nested learning treats every part of an AI model—including the optimizer and training algorithm—as memory. Backpropagation stores links between data and errors, and the optimizer's state, like momentum, acts as memory too. The Continuum Memory System (CMS) splits memory into modules that update at different rates, giving the model temporal depth.

Three learning levels, each with its own gradient flow and local memory for step-by-step parameter updates — Nested learning breaks the process into layers, each with its own gradient flow and goal. This example shows a model with three layers. | Image: Google

HOPE: Nested Learning in practice

Google's HOPE architecture puts this to work. HOPE uses long-term memory modules called Titans, which store information based on how surprising it is to the model. It layers different types of memory and uses CMS blocks for larger context windows. Fast layers process live input, slower layers distill what's important for long-term storage, and the system can adapt its update rules as it learns. This goes beyond the typical “pretrain and freeze” model.

HOPE with four levels of variable chunk lengths (16 to 16 M) and graded frequencies versus Transformer with infinite context and static updates. — HOPE splits the model into layers with different chunk sizes and update rates, so it can handle new and old information at different speeds. Standard transformers only work with the current context window or data from pretraining. | Image: Google

The team tested HOPE on language modeling and reasoning. With models at 1.3 billion parameters trained on 100 billion tokens, HOPE outperformed Transformer++ and newer models like RetNet and DeltaNet.

Two bar charts for 1.3 B/100 B tokens: HOPE achieves ~13 (lowest best) and ~58 (highest best), Titans ~14/57, Samba ~15/54, Transformer ~20/52. — HOPE gets the lowest loss and the highest benchmark scores compared to other models, though the margin is small. | Image: Google

HOPE also performed better in long-context and needle-in-a-haystack tests, where the model has to find something specific in a large pile of text. Tests ranged from 340 million to 1.3 billion parameters. HOPE’s gains were consistent, and the authors say it can outperform both transformers and modern recurrent networks. An independent reproduction is on Github.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

Google's Nested Learning aims to stop LLMs from catastrophic forgetting

How nested learning borrows from the brain

HOPE: Nested Learning in practice

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Google Deepmind brings agentic AI capabilities into robots with two new Gemini models

Google says AI content is fine, and SEO basics still apply to AI-powered search

AlphaEvolve is Google DeepMind's new AI system that autonomously creates better algorithms

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

German court deepens the split on AI and copyright with its latest ruling

Google's Nested Learning aims to stop LLMs from catastrophic forgetting

How nested learning borrows from the brain

HOPE: Nested Learning in practice

Share

Bank details