empirical studyChapter 5arXiv · 2023
Process Reward Models to Align Reasoning
Lightman et al. (OpenAI)
Abstract
We investigate the effectiveness of process reward models (PRMs) for improving mathematical reasoning. PRMs provide step-level feedback rather than outcome-level feedback, enabling more precise alignment of reasoning chains.
Key Contributions
- →Process reward model methodology
- →Step-level vs outcome-level comparison
- →PRM800K dataset
Eigenvector Commentary
PRMs are the right abstraction for evaluating long-horizon agent workflows. Outcome-only evaluation misses the critical intermediate steps where errors compound. For Zone III governance, step-level quality assessment is the foundation of reliable autonomous operation.
Topics
process reward modelsreasoning alignmentstep-level feedbackRLHF
Relevance Scores
Long-Horizon Score85
Enterprise Score76
Completeness84
Paper Info
Year2023
VenuearXiv
Typeempirical study
ChapterCh. 5
Authors1
Related Papers
ReAct: Synergizing Reasoning and Acting in Language Mod…
2023 · Ch.1
Reflexion: Language Agents with Verbal Reinforcement Le…
2023 · Ch.1
Tree of Thoughts: Deliberate Problem Solving with Large…
2023 · Ch.1
Toolformer: Language Models Can Teach Themselves to Use…
2023 · Ch.1
View all Chapter 5 papers →