Chapter 97: Direct Preference Optimization (DPO)
Learning objectives Derive the DPO loss from the Bradley-Terry preference model and the optimal policy under a KL constraint to the reference policy (the closed-form mapping from reward to policy in the BT model). Implement DPO: train the language model directly on preference data (prefer τ^w over τ^l) using the DPO loss, without training a separate reward model. Compare with PPO (reward model + PPO fine-tuning) in terms of preference accuracy, reward model score, and implementation complexity. Explain the advantage of DPO: no reward model, no PPO loop; just supervised loss on preferences. Relate DPO to dialogue and RLHF (alternative to reward model + PPO). Concept and real-world RL ...