Content
summary Summary

Google Research has introduced "nested learning," a new way to design AI models that aims to mitigate or even avoid "catastrophic forgetting" and support continuous learning.

Ad

In their NeurIPS 2025 paper, Google researchers highlight a core problem: large language models can't build new long-term memories after training. After training, these models only keep what's in their current context window or from pretraining. Expanding the window or retraining just delays the problem, like treating amnesia with a bigger notepad.

Current models are mostly static after pretraining. They can perform learned tasks but can’t pick up new abilities beyond their context, leading to so-called catastrophic forgetting. More updates make it worse.

How nested learning borrows from the brain

Like many machine learning advances, nested learning is inspired by neuroscience. The brain runs at different speeds: fast circuits handle the present, slower ones consolidate important patterns into long-term memory.

Ad
Ad

Most experiences fade quickly; only a few become lasting memories, thanks to neuroplasticity—the brain’s ability to rewire itself while preserving essential information. The authors contrast this with current LLMs, whose knowledge remains limited to their context window or static pretraining.

Schematic representation of brain waves (0.5–100 Hz) and four neural frequency levels in nested learning, which are mapped onto Q/K/V linear layers.
Nested learning groups model components by how often they update, using brain waves as a metaphor. This layered memory setup lets the model take in new information without overwriting what it already knows. | Image: Google

Nested learning treats every part of an AI model—including the optimizer and training algorithm—as memory. Backpropagation stores links between data and errors, and the optimizer's state, like momentum, acts as memory too. The Continuum Memory System (CMS) splits memory into modules that update at different rates, giving the model temporal depth.

Three learning levels, each with its own gradient flow and local memory for step-by-step parameter updates
Nested learning breaks the process into layers, each with its own gradient flow and goal. This example shows a model with three layers. | Image: Google

HOPE: Nested Learning in practice

Google's HOPE architecture puts this to work. HOPE uses long-term memory modules called Titans, which store information based on how surprising it is to the model. It layers different types of memory and uses CMS blocks for larger context windows. Fast layers process live input, slower layers distill what's important for long-term storage, and the system can adapt its update rules as it learns. This goes beyond the typical “pretrain and freeze” model.

HOPE with four levels of variable chunk lengths (16 to 16 M) and graded frequencies versus Transformer with infinite context and static updates.
HOPE splits the model into layers with different chunk sizes and update rates, so it can handle new and old information at different speeds. Standard transformers only work with the current context window or data from pretraining. | Image: Google

The team tested HOPE on language modeling and reasoning. With models at 1.3 billion parameters trained on 100 billion tokens, HOPE outperformed Transformer++ and newer models like RetNet and DeltaNet.

Two bar charts for 1.3 B/100 B tokens: HOPE achieves ~13 (lowest best) and ~58 (highest best), Titans ~14/57, Samba ~15/54, Transformer ~20/52.
HOPE gets the lowest loss and the highest benchmark scores compared to other models, though the margin is small. | Image: Google

HOPE also performed better in long-context and needle-in-a-haystack tests, where the model has to find something specific in a large pile of text. Tests ranged from 340 million to 1.3 billion parameters. HOPE’s gains were consistent, and the authors say it can outperform both transformers and modern recurrent networks. An independent reproduction is on Github.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google Research has developed "Nested Learning," a concept inspired by how the human brain forms memories, to address catastrophic forgetting in AI models.
  • Their new HOPE architecture organizes the model into multiple levels with varying context sizes and update cycles, allowing new information to be added without erasing previous knowledge.
  • In testing, HOPE outperformed conventional transformer models on language and memory tasks, particularly when handling long-context challenges.
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.