Chapter 95: Training Large Language Models with PPO
Learning objectives Implement a PPO loop to fine-tune a small language model (e.g. GPT-2 small or DistilGPT-2) for text generation with a simple reward (e.g. positive sentiment, or length). Include a KL penalty (or KL constraint) so that the updated policy does not deviate too far from the initial (reference) policy, preventing mode collapse and maintaining fluency. Generate sequences with the current policy, compute reward for each sequence, and update the policy with PPO (clip + KL). Observe that without KL penalty the policy may collapse (e.g. always output the same high-reward token); with KL it stays diverse. Relate to dialogue and RLHF: same PPO+KL setup is used for aligning LMs with human preferences. Concept and real-world RL ...