Chapter 80: RL from Human Feedback (RLHF) Basics
Learning objectives Implement a Bradley-Terry model to learn a reward function from pairwise comparisons of two trajectories (or segments): given (τ^w, τ^l) meaning “τ^w is preferred over τ^l,” fit r so that E[r(τ^w)] > E[r(τ^l)]. Use the learned reward to train a policy with PPO (or another policy gradient method): maximize expected return under r. Explain the RLHF pipeline: collect preferences → train reward model → train policy on reward model. Test on a simple environment with simulated preferences (e.g. prefer longer/higher-return trajectories) and verify the policy improves. Relate RLHF to dialogue (prefer helpful/harmless responses) and recommendation (prefer engaging content). Concept and real-world RL ...