Google DeepMind researchers have developed a new technology called SCoRe to help large language models recognize and fix their own mistakes.
Current large language models (LLMs) struggle with self-correction, often requiring multiple models or external checks. SCoRe, which stands for "Self-Correction via Reinforcement Learning," uses reinforcement learning to train a single model using only self-generated data.
SCoRe works in two phases. First, it optimizes model initialization to generate corrections on the second try while keeping initial responses similar to the base model. This uses a special loss function considering both aspects.
The second phase applies multi-stage reinforcement learning. The model learns to improve both first and second answers. A reward function encourages self-correction by giving more weight to improvements between attempts. Unlike methods needing external verification, SCoRe uses only self-generated training data. The model creates its own examples by solving problems and trying to improve solutions.
SCoRe achieves significant self-correction
Tests with Google's Gemini 1.0 Pro and 1.5 Flash models showed significant gains. On the MATH benchmark for mathematical reasoning, self-correction improved by 15.6 percentage points. For code generation on HumanEval, it rose 9.1 percentage points.
The researchers say SCoRe is the first approach achieving meaningful positive intrinsic self-correction, allowing models to improve answers without external feedback.
However, SCoRe currently only trains for one round of self-correction. Future work could explore multiple correction steps.
The team concludes that teaching metastrategies like self-correction requires going beyond standard LLM training. Multi-stage reinforcement learning may offer new possibilities in this area.