Chapter 33: The REINFORCE Algorithm

Learning objectives Implement REINFORCE (Monte Carlo policy gradient): estimate \(\nabla_\theta J\) using the return \(G_t\) from full episodes. Use a neural network policy with softmax output for discrete actions (e.g. CartPole). Observe and explain the high variance of gradient estimates when using raw returns \(G_t\) (no baseline). Concept and real-world RL REINFORCE is the simplest policy gradient algorithm: run an episode under \(\pi_\theta\), compute the return \(G_t\) from each step, and update \(\theta\) with \(\theta \leftarrow \theta + \alpha \sum_t G_t \nabla_\theta \log \pi(a_t|s_t)\). It is on-policy and Monte Carlo (needs full episodes). The variance of \(G_t\) can be large, especially in long episodes, which makes learning slow or unstable. In game AI, REINFORCE is a baseline for more advanced methods (actor-critic, PPO); in robot control, it is rarely used alone because of sample efficiency and variance. Adding a baseline (e.g. state-value function) reduces variance without introducing bias. ...

March 10, 2026 · 3 min · 602 words · codefrydev