Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

id:: 2411.12580
Authors:: Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwarak Talupuru, Acyr Locatelli, Robert Kirk, Tim Rocktäschel, Edward Grefenstette, Max Bartolo
Published:: 2024-11-19
arXiv:: https://arxiv.org/abs/2411.12580
PDF:: https://arxiv.org/pdf/2411.12580
DOI:: N/A
Journal Reference:: N/A
Primary Category:: cs.CL
Categories:: cs.CL, cs.LG
Comment:: N/A
github_url:: _

abstract

The capabilities and limitations of Large Language Models have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume of data used in the design of LLMs has precluded us from applying the method traditionally used to measure generalisation: train-test set separation. To overcome this, we study what kind of generalisation strategies LLMs employ when performing reasoning tasks by investigating the pretraining data they rely on. For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents influence the model outputs for three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions. We find that, while the models rely on mostly distinct sets of data for each factual question, a document often has a similar influence across different reasoning questions within the same task, indicating the presence of procedural knowledge. We further find that the answers to factual questions often show up in the most influential data. However, for reasoning questions the answers usually do not show up as highly influential, nor do the answers to the intermediate reasoning steps. When we characterise the top ranked documents for the reasoning questions qualitatively, we confirm that the influential documents often contain procedural knowledge, like demonstrating how to obtain a solution using formulae or code. Our findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar form of reasoning.

premise

outline

quotes

notes

summary

1. Brief overview

This paper investigates how large language models (LLMs) generalize when performing reasoning tasks. The authors overcome the limitations of traditional train-test set separation by analyzing the pretraining data that influences model outputs for various reasoning and factual questions. They focus on three simple mathematical reasoning tasks and contrast them with factual questions, using influence functions to identify influential documents. The research is conducted on two Cohere Command R models (7B and 35B parameters) and a subset of their pretraining data.

2. Key points

Procedural knowledge drives reasoning: Documents influencing reasoning traces often contain procedural knowledge (e.g., formulas or code demonstrating solution methods). This suggests LLMs are generalizing strategies rather than simply retrieving answers.
Models rely less on individual documents for reasoning: The influence of individual documents on reasoning is weaker than on factual questions, indicating a reliance on a broader, more general set of documents.
Factual answers are often found in influential data, reasoning answers rarely are: Answers to factual questions frequently appear in highly influential pretraining documents. This is not the case for reasoning questions, nor for intermediate reasoning steps.
Code plays a significant role in mathematical reasoning: Code data is overrepresented among influential documents for reasoning tasks. This highlights the importance of high-quality procedural data in model pretraining.
Larger models show stronger effects: The 35B parameter model exhibited more pronounced differences in influence between reasoning and factual questions compared to the 7B parameter model.

3. Notable quotes

No direct quotes were extracted as significantly important for future reference. The findings are primarily conveyed through the results and analysis.

4. Primary themes

LLM Generalization: The core theme is understanding how LLMs generalize to new reasoning tasks, moving beyond simple retrieval-based explanations.
Interpretability: The study uses influence functions to gain insights into the internal mechanisms of LLMs, exploring what parts of the training data are most influential for specific types of questions.
Pretraining Data Impact: A crucial theme is the role of pretraining data quality and composition in driving the success of LLMs on reasoning tasks. The study suggests that focusing on high-quality procedural knowledge rather than comprehensive coverage of all possible cases might be a more effective approach for pretraining.