Learning objectives
- Demonstrate how a too-large step size in policy gradient updates can cause policy collapse (e.g. one action gets probability near 1 too quickly) and loss of exploration.
- Visualize policy probabilities over time in a simple bandit problem under different learning rates.
- Relate this to the motivation for trust region and clipped methods (e.g. PPO, TRPO).
Concept and real-world RL
Standard policy gradient \(\theta \leftarrow \theta + \alpha \nabla_\theta J\) can be unstable: a single bad batch or a large step can make the policy assign near-zero probability to previously good actions (policy collapse). In a multi-armed bandit (or a simple MDP), this is easy to see: with a large \(\alpha\), the policy can become deterministic too fast and get stuck. In robot control and game AI, we want to avoid catastrophic updates; PPO (clipped objective) and TRPO (KL constraint) limit how much the policy can change per update. This chapter illustrates the problem in a minimal setting.
Where you see this in practice: PPO and TRPO are widely used in robotics and RL benchmarks precisely because they avoid the instability of vanilla policy gradients.
Illustration (policy collapse): With too large a step size, policy probabilities can collapse (e.g. one action gets probability 1). The chart below shows a healthy policy vs a collapsed one over 3 actions.
Exercise: In a simple bandit problem, implement a policy gradient update with too large a step size. Show how it can lead to a collapse in performance. Visualize the policy probabilities over time.
Professor’s hints
- Bandit: K arms, policy \(\pi(a) = \mathrm{softmax}(\theta)\). Sample action, get reward, update \(\theta\) with REINFORCE: \(\theta_a \leftarrow \theta_a + \alpha , (r - b) , (1 - \pi(a))\) for the chosen action (and similar for others in the gradient). Use a large \(\alpha\) (e.g. 0.5 or 1.0) and watch \(\pi\) become near-one-hot quickly; then try a smaller \(\alpha\).
- Visualize: plot \(\pi(a)\) for each arm over iterations. With large \(\alpha\), one arm dominates fast; with small \(\alpha\), probabilities change smoothly.
- Performance collapse: if the best arm has high variance, a few bad samples can push the policy away from it; with large steps, recovery is slow or impossible.
Common pitfalls
- Baseline: Use a baseline (e.g. running average of rewards) so the update is not purely driven by raw reward magnitude; the collapse effect is still visible with large \(\alpha\).
- Initialization: Start with uniform or near-uniform \(\theta\) so you can see the policy move; if you start already peaked, the effect is less clear.
Worked solution (warm-up: why large step size collapses policy)
Warm-up: A very large step size can push the softmax probabilities so far toward one action that the policy becomes nearly deterministic; once one action dominates, its gradient gets most of the updates and the others get little signal, so the policy “collapses” to that action. Smaller \(\alpha\) or a baseline/KL penalty keeps updates moderate and avoids this. This is why PPO and TRPO use constrained or clipped updates.
Extra practice
- Warm-up: In one sentence, why does a very large policy gradient step size risk “collapsing” the policy to one action?
- Coding: Implement the bandit policy gradient with \(\alpha \in \{0.01, 0.1, 0.5\}\). For each, plot policy probabilities over 500 steps. Which one collapses fastest?
- Challenge: Add a KL penalty to the update (e.g. penalize \(D_{KL}(\pi_{old} \| \pi_{new})\)). Does a moderate penalty prevent collapse even with larger \(\alpha\)?