Skip to main content

Learn
search
tags
Archives

KL Penalty

Overall Progress 0%

PPO fine-tune small LM (e.g. GPT-2) for sentiment; KL penalty.

Go to Chapter 95: Training Large Language Models with PPO →

© 2026 Reinforcement Learning Curriculum