Chapter 44: PPO: Implementation Details
Learning objectives Implement Generalized Advantage Estimation (GAE): compute advantage estimates \(\hat{A}_t\) from a trajectory of rewards and value estimates using \(\gamma\) and \(\lambda\). Write the recurrence: \(\hat{A}t = \delta_t + (\gamma\lambda) \delta{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \cdots\) where \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\). Use GAE in a PPO (or actor-critic) pipeline so advantages are fed into the policy loss. Concept and real-world RL GAE (Generalized Advantage Estimation) provides a bias–variance trade-off for the advantage: \(\hat{A}t^{GAE} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}\). When \(\lambda=0\), \(\hat{A}_t = \delta_t\) (1-step TD, low variance, high bias). When \(\lambda=1\), \(\hat{A}_t = G_t - V(s_t)\) (Monte Carlo, high variance, low bias). Tuning \(\lambda\) (e.g. 0.95–0.99) balances the two. In robot control and game AI, GAE is the standard way to compute advantages for PPO and actor-critic; it is implemented with a backward loop over the trajectory. ...