Bradley-Terry

Overall Progress 0%

Bradley-Terry from pairwise comparisons; train policy with PPO.

Simulated preference data; Bradley-Terry reward model; PPO finetune.

DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.