HomeResearch LibraryProcess Reward Models to Align Reasoning
empirical studyChapter 5arXiv · 2023

Process Reward Models to Align Reasoning

Lightman et al. (OpenAI)

Abstract

We investigate the effectiveness of process reward models (PRMs) for improving mathematical reasoning. PRMs provide step-level feedback rather than outcome-level feedback, enabling more precise alignment of reasoning chains.

Key Contributions

  • Process reward model methodology
  • Step-level vs outcome-level comparison
  • PRM800K dataset
Eigenvector Commentary

PRMs are the right abstraction for evaluating long-horizon agent workflows. Outcome-only evaluation misses the critical intermediate steps where errors compound. For Zone III governance, step-level quality assessment is the foundation of reliable autonomous operation.

Topics

process reward modelsreasoning alignmentstep-level feedbackRLHF