Chapter 34: Reducing Variance in Policy Gradients
Learning objectives Add a state-value baseline \(V(s)\) to REINFORCE and explain why it reduces variance without introducing bias (when the baseline does not depend on the action). Train the baseline network (e.g. MSE to fit returns \(G_t\)) alongside the policy. Compare the variance of gradient estimates (e.g. magnitude of parameter updates or variance of \(G_t - b(s_t)\)) with and without baseline. Concept and real-world RL The policy gradient with a baseline is \(\mathbb{E}[ \nabla \log \pi(a|s) , (G_t - b(s)) ]\). If \(b(s)\) does not depend on the action \(a\), this is still an unbiased estimate of \(\nabla J\); the baseline only changes the variance. A natural choice is \(b(s) = V^\pi(s)\), the expected return from state \(s\). Then the term \(G_t - V(s_t)\) is an estimate of the advantage (how much better this trajectory was than average). In game AI or robot control, lower-variance gradients mean faster and more stable learning; baselines are standard in actor-critic and PPO. ...