Chapter 80: RL from Human Feedback (RLHF) Basics

Learning objectives Implement a Bradley-Terry model to learn a reward function from pairwise comparisons of two trajectories (or segments): given (τ^w, τ^l) meaning “τ^w is preferred over τ^l,” fit r so that E[r(τ^w)] > E[r(τ^l)]. Use the learned reward to train a policy with PPO (or another policy gradient method): maximize expected return under r. Explain the RLHF pipeline: collect preferences → train reward model → train policy on reward model. Test on a simple environment with simulated preferences (e.g. prefer longer/higher-return trajectories) and verify the policy improves. Relate RLHF to dialogue (prefer helpful/harmless responses) and recommendation (prefer engaging content). Concept and real-world RL ...

March 10, 2026 · 4 min · 708 words · codefrydev

Chapter 96: Implementing RLHF in NLP

Learning objectives Collect (or simulate) human preference data: pairs of model responses to the same prompt, with a label indicating which response is preferred. Train a reward model using the Bradley-Terry loss: P(τ^w preferred over τ^l) = σ(r(τ^w) - r(τ^l)), where r is the reward model (e.g. LM that outputs a scalar or a separate head). Fine-tune the language model with PPO using the learned reward model as the reward (and a KL penalty to the initial LM). Evaluate on held-out prompts: generate with the fine-tuned LM and score with the reward model; optionally compare with the initial LM. Relate to the dialogue anchor and real RLHF pipelines (InstructGPT, etc.). Concept and real-world RL ...

March 10, 2026 · 4 min · 705 words · codefrydev

Chapter 97: Direct Preference Optimization (DPO)

Learning objectives Derive the DPO loss from the Bradley-Terry preference model and the optimal policy under a KL constraint to the reference policy (the closed-form mapping from reward to policy in the BT model). Implement DPO: train the language model directly on preference data (prefer τ^w over τ^l) using the DPO loss, without training a separate reward model. Compare with PPO (reward model + PPO fine-tuning) in terms of preference accuracy, reward model score, and implementation complexity. Explain the advantage of DPO: no reward model, no PPO loop; just supervised loss on preferences. Relate DPO to dialogue and RLHF (alternative to reward model + PPO). Concept and real-world RL ...

March 10, 2026 · 4 min · 670 words · codefrydev