Training Language Models to Self-Correct via Reinforcement Learning
- id:
2409.12917
- Authors:
Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust
- Published:
2024-09-19
- arXiv:
- PDF:
- DOI:
N/A
- Journal Reference:
N/A
- Primary Category:
cs.LG
- Categories:
cs.LG
- Comment:
N/A
- github_url:
_
abstract
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model’s own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
premise
outline
quotes
notes
summary
1. Brief Overview
This paper introduces SCoRe, a novel multi-turn online reinforcement learning (RL) approach that significantly improves a large language model’s (LLM) self-correction ability using entirely self-generated data. Existing methods for training self-correction in LLMs typically rely on multiple models, advanced models, or external supervision. SCoRe addresses these limitations by training a single model to both generate responses and correct its own mistakes, using a two-stage training process to mitigate issues like distribution shift and behavior collapse. The method achieves state-of-the-art self-correction performance on MATH and HumanEval benchmarks.
2. Key Points
SCoRe is a novel multi-turn online RL approach for training LLMs to self-correct.
It uses entirely self-generated data, avoiding the need for external supervision or multiple models.
A two-stage training process addresses distribution shift and behavior collapse issues common in supervised fine-tuning (SFT) and standard RL approaches.
Stage I initializes the RL process with a policy less susceptible to collapse.
Stage II uses reward shaping to incentivize effective self-correction behavior.
SCoRe achieves state-of-the-art results on MATH and HumanEval benchmarks, significantly improving base model performance.
The paper demonstrates that SFT on self-generated data is insufficient for robust self-correction.
3. Notable Quotes
None explicitly stated, but the core argument can be summarized as: “existing methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision… SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time…”
4. Primary Themes
Self-correction in LLMs: The central theme is improving the ability of LLMs to autonomously identify and correct their own mistakes.
Reinforcement Learning: The paper focuses on using RL as the primary training paradigm to address the limitations of supervised learning approaches.
Self-supervised Learning: The approach emphasizes learning from self-generated data, making it a form of self-supervised learning.
Overcoming Limitations of SFT and Standard RL: The authors extensively analyze the shortcomings of existing techniques, highlighting the need for a novel approach like SCoRe.
Benchmarking and Evaluation: The paper rigorously evaluates SCoRe’s performance against baselines and other state-of-the-art methods on established benchmarks.