Chapter 41: The Problem with Standard Policy Gradients

Learning objectives Demonstrate how a too-large step size in policy gradient updates can cause policy collapse (e.g. one action gets probability near 1 too quickly) and loss of exploration. Visualize policy probabilities over time in a simple bandit problem under different learning rates. Relate this to the motivation for trust region and clipped methods (e.g. PPO, TRPO). Concept and real-world RL Standard policy gradient \(\theta \leftarrow \theta + \alpha \nabla_\theta J\) can be unstable: a single bad batch or a large step can make the policy assign near-zero probability to previously good actions (policy collapse). In a multi-armed bandit (or a simple MDP), this is easy to see: with a large \(\alpha\), the policy can become deterministic too fast and get stuck. In robot control and game AI, we want to avoid catastrophic updates; PPO (clipped objective) and TRPO (KL constraint) limit how much the policy can change per update. This chapter illustrates the problem in a minimal setting. ...

March 10, 2026 · 3 min · 563 words · codefrydev