Training Language Models to Self-Correct via Reinforcement Learning

id:: 2409.12917
Authors:: Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust
Published:: 2024-09-19
arXiv:: https://arxiv.org/abs/2409.12917
PDF:: https://arxiv.org/pdf/2409.12917
DOI:: N/A
Journal Reference:: N/A
Primary Category:: cs.LG
Categories:: cs.LG
Comment:: N/A
github_url:: _

abstract

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model’s own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

premise

outline

quotes

notes

summary

1. Brief Overview

This paper introduces SCoRe, a novel multi-turn online reinforcement learning (RL) approach that significantly improves a large language model’s (LLM) self-correction ability using entirely self-generated data. Existing methods for training self-correction in LLMs typically rely on multiple models, advanced models, or external supervision. SCoRe addresses these limitations by training a single model to both generate responses and correct its own mistakes, using a two-stage training process to mitigate issues like distribution shift and behavior collapse. The method achieves state-of-the-art self-correction performance on MATH and HumanEval benchmarks.

2. Key Points

SCoRe is a novel multi-turn online RL approach for training LLMs to self-correct.
It uses entirely self-generated data, avoiding the need for external supervision or multiple models.
A two-stage training process addresses distribution shift and behavior collapse issues common in supervised fine-tuning (SFT) and standard RL approaches.
Stage I initializes the RL process with a policy less susceptible to collapse.
Stage II uses reward shaping to incentivize effective self-correction behavior.
SCoRe achieves state-of-the-art results on MATH and HumanEval benchmarks, significantly improving base model performance.
The paper demonstrates that SFT on self-generated data is insufficient for robust self-correction.

3. Notable Quotes

None explicitly stated, but the core argument can be summarized as: “existing methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision… SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time…”

4. Primary Themes

Self-correction in LLMs: The central theme is improving the ability of LLMs to autonomously identify and correct their own mistakes.
Reinforcement Learning: The paper focuses on using RL as the primary training paradigm to address the limitations of supervised learning approaches.
Self-supervised Learning: The approach emphasizes learning from self-generated data, making it a form of self-supervised learning.
Overcoming Limitations of SFT and Standard RL: The authors extensively analyze the shortcomings of existing techniques, highlighting the need for a novel approach like SCoRe.
Benchmarking and Evaluation: The paper rigorously evaluates SCoRe’s performance against baselines and other state-of-the-art methods on established benchmarks.