When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

id:

2410.01792

Authors:

Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, Thomas L. Griffiths

Published:

2024-10-02

arXiv:

https://arxiv.org/abs/2410.01792

PDF:

https://arxiv.org/pdf/2410.01792

DOI:

N/A

Journal Reference:

N/A

Primary Category:

cs.CL

Categories:

cs.CL, cs.AI

Comment:

6 pages; updated to fix typo in Fig 4 caption

github_url:

_

abstract

In “Embers of Autoregression” (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 – like previous LLMs – is sensitive to the probability of examples and tasks, performing better and requiring fewer “thinking tokens” in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model’s probability sensitivity.

premise

outline

quotes

notes

summary

1. Brief overview

This paper analyzes OpenAI’s new language model, o1, investigating whether it retains characteristics of autoregression despite being optimized for reasoning. The authors find that while o1 significantly outperforms previous LLMs on various reasoning tasks, particularly those involving rare variants of common problems, it still exhibits sensitivity to both the probability of outputs and the frequency of tasks. This suggests that optimizing for reasoning can mitigate but not fully eliminate the influence of the model’s autoregressive origins.

2. Key points

o1 substantially outperforms previous LLMs on many reasoning tasks, especially rare variants.
Despite improvements, o1 still shows sensitivity to output probability (performing better on high-probability outputs).
o1 also displays sensitivity to task frequency (performing better on common task variants).
This sensitivity to probability and frequency is less pronounced in o1 than in previous LLMs.
The study uses “thinking tokens” as a measure of difficulty, corroborating accuracy-based findings.
More challenging task variations reveal stronger task frequency effects in o1.

3. Notable quotes

No direct quotes were identified in the provided text that would be particularly valuable for future reference beyond the summary already provided.

4. Primary themes

The persistence of autoregressive biases in LLMs optimized for reasoning: Even when a language model is trained to focus on reasoning, its underlying autoregressive nature still influences its performance.
The teleological perspective in AI analysis: Understanding the pressures that shape an AI system’s development is crucial for analyzing its strengths and limitations.
Quantitative and qualitative analysis of LLM performance: The study employs both accuracy and token usage metrics to evaluate model performance, providing a more comprehensive understanding of its capabilities and limitations.
The impact of task frequency and output probability on LLM performance: The study demonstrates the continued influence of data distribution on LLM performance, even in models specifically designed for reasoning tasks.