Bradley-Terry
Overall Progress
0%
Bradley-Terry from pairwise comparisons; train policy with PPO.
Simulated preference data; Bradley-Terry reward model; PPO finetune.
DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.
Bradley-Terry from pairwise comparisons; train policy with PPO.
Simulated preference data; Bradley-Terry reward model; PPO finetune.
DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.