Chapter 96: Implementing RLHF in NLP
Learning objectives Collect (or simulate) human preference data: pairs of model responses to the same prompt, with a label indicating which response is preferred. Train a reward model using the Bradley-Terry loss: P(τ^w preferred over τ^l) = σ(r(τ^w) - r(τ^l)), where r is the reward model (e.g. LM that outputs a scalar or a separate head). Fine-tune the language model with PPO using the learned reward model as the reward (and a KL penalty to the initial LM). Evaluate on held-out prompts: generate with the fine-tuned LM and score with the reward model; optionally compare with the initial LM. Relate to the dialogue anchor and real RLHF pipelines (InstructGPT, etc.). Concept and real-world RL ...