Content
summary Summary

Researchers at MIT have introduced a new framework called SEAL that lets large language models (LLMs) generate their own synthetic training data and improve themselves without outside help.

Ad

SEAL works in two stages. First, the model learns to create effective "self-edits" using reward learning. These self-edits are written as natural language instructions that define new training data and set optimization parameters. In the second stage, the system applies these instructions and updates its own weights through machine learning.

Flowchart: SEAL process for LM optimization through RL, self-edits (SE), testing, rewards, and iterative policy updates (θt+1).
The model suggests its own corrections (SE), updates its weights, and is evaluated on the task. Reinforcement learning (RL) helps it generate better edits with each cycle. | Image: Zweiger et al.

A key part of SEAL is its ReST^EM algorithm, which acts like a filter: it only keeps and reinforces self-edits that actually improve performance. The algorithm collects different edits, tests which ones work, and then trains the model using only the successful variants. SEAL also uses Low-Rank Adapters (LoRA), a technique that enables quick, lightweight updates without retraining the entire model.

The researchers put SEAL to the test in two scenarios. In the first, they used Qwen2.5-7B on a text comprehension task. The model generated logical inferences from text and then trained on its own outputs.

Ad
Ad
Synthetic implications generated from a text passage serve as training data for LoRA fine-tuning. | Image: Zweiger et al.

SEAL reached an accuracy of 47 percent, beating the comparison method's 33.5 percent. The quality of its self-generated data even surpassed that of OpenAI's GPT-4.1, despite the underlying model being much smaller.

Left: QA self-edits before vs. after 3 RL iterations. Right: Comparison of average edit lengths for base, prompt, and RL.
Reinforcement learning produces more detailed self-edits, which in turn boost performance. | Image: Zweiger et al.

In a second test, the team looked at Few-Shot Prompting using Llama 3.2-1B on a reasoning task. Here, the model picked different data processing techniques and training parameters from a preset toolkit. With SEAL, the model achieved a 72.5 percent success rate, compared to just 20 percent without any prior training.

"Catastrophic forgetting" remains a challenge

Despite the strong results, the researchers found several limits. The main issue is "catastrophic forgetting": when the model takes on new tasks, it starts to lose performance on previous ones. Training is also resource-intensive, since each evaluation of a self-edit takes 30 to 45 seconds.

Heat map: Model performance after successive self-edit iterations on passages 0–7 shows decreasing accuracy on earlier tasks.
Each self-edit round leads to declining accuracy on earlier-learned content. | Image: Zweiger et al.

Tackling the data wall

The MIT team sees SEAL as a step toward overcoming the so-called "data wall"—the point where all available human-written training data has been used up. Separately, researchers have also warned about the risk of "model collapse," where models degrade in quality when trained too heavily on low-quality AI-generated data. SEAL could enable ongoing learning and autonomous AI systems that keep adapting to new goals and information.

If models can teach themselves by absorbing new material—like scientific papers—and generating their own explanations and inferences, they could keep improving on rare or underrepresented topics. This kind of self-driven learning loop may help push language models past current limits.

Recommendation

The source code for SEAL is available on GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • MIT researchers have developed SEAL, a framework that allows large language models to generate their own synthetic training data and improve themselves without outside input, using a process of self-editing and reinforcement learning.
  • In tests, SEAL enabled models to outperform comparison methods on text comprehension and reasoning tasks, with self-generated data even surpassing that of larger models like GPT-4.1, despite using much smaller underlying models.
  • The approach faces challenges such as "catastrophic forgetting," where models lose performance on earlier tasks as they learn new ones, and high computational demands, but it offers a pathway for language models to keep learning and adapting beyond the limits of existing human-written data.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.