Bradley-Terry

Overall Progress 0%

Bradley-Terry from pairwise comparisons; train policy with PPO.

Go to Chapter 80: RL from Human Feedback (RLHF) Basics →

Simulated preference data; Bradley-Terry reward model; PPO finetune.

Go to Chapter 96: Implementing RLHF in NLP →

DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.

Go to Chapter 97: Direct Preference Optimization (DPO) →