Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov (Stanford), Archit Sharma (Stanford)

Abstract

We introduce Direct Preference Optimization (DPO), a stable, performant, and computationally lightweight alternative to RLHF. DPO directly optimizes for human preferences without explicit reward modeling.

Key Contributions

→DPO algorithm for preference optimization
→Elimination of explicit reward model
→Stable and efficient alignment training

Topics

preference optimizationRLHFalignmentfine-tuning

Relevance Scores

Long-Horizon Score72

Enterprise Score78

Completeness78

Paper Info

Year2023

VenueNeurIPS 2023

Typetheoretical framework

ChapterCh. 5

Authors2

Frameworks

PASF AEGIS