theoretical frameworkChapter 5NeurIPS 2023 · 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov (Stanford), Archit Sharma (Stanford)
Abstract
We introduce Direct Preference Optimization (DPO), a stable, performant, and computationally lightweight alternative to RLHF. DPO directly optimizes for human preferences without explicit reward modeling.
Key Contributions
- →DPO algorithm for preference optimization
- →Elimination of explicit reward model
- →Stable and efficient alignment training
Topics
preference optimizationRLHFalignmentfine-tuning
Relevance Scores
Long-Horizon Score72
Enterprise Score78
Completeness78
Paper Info
Year2023
VenueNeurIPS 2023
Typetheoretical framework
ChapterCh. 5
Authors2
Related Papers
ReAct: Synergizing Reasoning and Acting in Language Mod…
2023 · Ch.1
Reflexion: Language Agents with Verbal Reinforcement Le…
2023 · Ch.1
Tree of Thoughts: Deliberate Problem Solving with Large…
2023 · Ch.1
Toolformer: Language Models Can Teach Themselves to Use…
2023 · Ch.1
View all Chapter 5 papers →