Direct Preference Optimization

Overall Progress 0%

DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.