Training Language Models to Self-Correct via Reinforcement Learning

id:

2409.12917

Authors:

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust

Published:

2024-09-19

arXiv:

https://arxiv.org/abs/2409.12917

PDF:

https://arxiv.org/pdf/2409.12917

DOI:

N/A

Journal Reference:

N/A

Primary Category:

cs.LG

Categories:

cs.LG

Comment:

N/A

github_url:

_

abstract

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model’s own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

premise

outline

quotes

notes

summary

1. Brief Overview

This paper introduces SCoRe, a novel multi-turn online reinforcement learning (RL) approach that significantly improves a large language model’s (LLM) self-correction ability using entirely self-generated data. Existing methods for training self-correction in LLMs typically rely on multiple models, advanced models, or external supervision. SCoRe addresses these limitations by training a single model to both generate responses and correct its own mistakes, using a two-stage training process to mitigate issues like distribution shift and behavior collapse. The method achieves state-of-the-art self-correction performance on MATH and HumanEval benchmarks.

2. Key Points

  • SCoRe is a novel multi-turn online RL approach for training LLMs to self-correct.

  • It uses entirely self-generated data, avoiding the need for external supervision or multiple models.

  • A two-stage training process addresses distribution shift and behavior collapse issues common in supervised fine-tuning (SFT) and standard RL approaches.

  • Stage I initializes the RL process with a policy less susceptible to collapse.

  • Stage II uses reward shaping to incentivize effective self-correction behavior.

  • SCoRe achieves state-of-the-art results on MATH and HumanEval benchmarks, significantly improving base model performance.

  • The paper demonstrates that SFT on self-generated data is insufficient for robust self-correction.

3. Notable Quotes

None explicitly stated, but the core argument can be summarized as: “existing methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision… SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time…”

4. Primary Themes

  • Self-correction in LLMs: The central theme is improving the ability of LLMs to autonomously identify and correct their own mistakes.

  • Reinforcement Learning: The paper focuses on using RL as the primary training paradigm to address the limitations of supervised learning approaches.

  • Self-supervised Learning: The approach emphasizes learning from self-generated data, making it a form of self-supervised learning.

  • Overcoming Limitations of SFT and Standard RL: The authors extensively analyze the shortcomings of existing techniques, highlighting the need for a novel approach like SCoRe.

  • Benchmarking and Evaluation: The paper rigorously evaluates SCoRe’s performance against baselines and other state-of-the-art methods on established benchmarks.