Chapter 44: PPO: Implementation Details

Learning objectives

Implement Generalized Advantage Estimation (GAE): compute advantage estimates \(\hat{A}_t\) from a trajectory of rewards and value estimates using \(\gamma\) and \(\lambda\).
Write the recurrence: \(\hat{A}t = \delta_t + (\gamma\lambda) \delta{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \cdots\) where \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\).
Use GAE in a PPO (or actor-critic) pipeline so advantages are fed into the policy loss.

Concept and real-world RL

GAE (Generalized Advantage Estimation) provides a bias–variance trade-off for the advantage: \(\hat{A}t^{GAE} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}\). When \(\lambda=0\), \(\hat{A}_t = \delta_t\) (1-step TD, low variance, high bias). When \(\lambda=1\), \(\hat{A}_t = G_t - V(s_t)\) (Monte Carlo, high variance, low bias). Tuning \(\lambda\) (e.g. 0.95–0.99) balances the two. In robot control and game AI, GAE is the standard way to compute advantages for PPO and actor-critic; it is implemented with a backward loop over the trajectory.

Where you see this in practice: GAE is used in almost every PPO and A2C implementation (OpenAI Baselines, Stable-Baselines3, CleanRL).

Illustration (GAE advantages): GAE(\(\lambda\)) blends TD errors across time. The chart below shows advantage estimates along a 5-step trajectory (\(\lambda=0.95\)).

Exercise: Implement Generalized Advantage Estimation (GAE) for a trajectory. Write a function that takes rewards and value estimates and returns GAE advantages for each timestep using \(\lambda\) and \(\gamma\).

Professor’s hints

Inputs: rewards \(r_0, \ldots, r_{T-1}\), values \(V(s_0), \ldots, V(s_T)\) (or one less for last state), \(\gamma\), \(\lambda\). Don’t forget the bootstrap \(V(s_T)\) for the last step (or use 0 if terminal).
Recursion: \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\). Then \(\hat{A}t = \delta_t + \gamma\lambda \hat{A}{t+1}\). Implement by looping backward from \(t = T-1\) to 0.
Output: array of advantages of length T (one per step). Optionally normalize advantages to zero mean and unit variance before feeding to PPO.

Common pitfalls

Off-by-one in value indices: Ensure \(V(s_{t+1})\) is used in \(\delta_t\); for the last step, use \(V(s_T)\) if non-terminal or 0 if done.
Done flag: If the episode ends at step \(T\), set \(V(s_T)=0\) (or mask the bootstrap). Otherwise you will use the value of a state that starts the next episode.

Worked solution (warm-up: GAE)

Key idea: GAE is \(\hat{A}t = \sum{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}\) with \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\). \(\lambda=0\) gives TD(0); \(\lambda=1\) gives the full return minus baseline. \(\lambda \in (0,1)\) balances bias and variance. We compute it backward in one pass: \(\hat{A}t = \delta_t + \gamma \lambda \hat{A}{t+1}\), with \(\hat{A}_T = 0\) at episode end.

Extra practice

Warm-up: For \(\lambda=0\), what is \(\hat{A}_t\) in terms of \(\delta_t\)? For \(\lambda=1\), what is \(\hat{A}_t\) in terms of \(G_t\) and \(V(s_t)\)?
Coding: Implement gae(rewards, values, gamma=0.99, lambda_=0.95) where values[i] is \(V(s_i)\) and length is len(rewards)+1. Test on a short trajectory of length 5; check that for \(\lambda=0\) you get \(\hat{A}_t = \delta_t\).
Challenge: Vectorize the backward loop (e.g. with NumPy or PyTorch) so you can compute GAE for a batch of trajectories at once.
Variant: Compute GAE for the same 5-step trajectory with \(\lambda \in \{0, 0.5, 0.95, 1.0\}\). Plot the advantage estimates \(\hat{A}_t\) for each. How does the variance of advantages change with \(\lambda\)?

Try it — edit and run (Shift+Enter)

Debug: The GAE backward loop below has an off-by-one error — it uses values[t] as the next-state value instead of values[t+1]. Fix it.

1
2
3
4
5
6
7
8
def gae_buggy(rewards, values, gamma=0.99, lam=0.95):
    advantages, gae_val = [], 0
    for t in reversed(range(len(rewards))):
        # BUG: should be values[t+1], not values[t]
        delta = rewards[t] + gamma * values[t] - values[t]
        gae_val = delta + gamma * lam * gae_val
        advantages.insert(0, gae_val)
    return advantages

Conceptual: How does GAE interpolate between pure TD(0) (\(\lambda=0\)) and full Monte Carlo (\(\lambda=1\))? Explain the bias-variance trade-off as \(\lambda\) increases from 0 to 1.
Recall: Write the GAE recurrence \(\hat{A}t = \delta_t + \gamma\lambda \hat{A}{t+1}\) and the closed-form sum it equals from memory.