Chapter 45: Coding PPO from Scratch

Learning objectives

Implement a full PPO agent for LunarLanderContinuous-v2: policy (actor) and value (critic) networks, rollout buffer, GAE for advantages, and multiple epochs of minibatch updates per rollout.
Tune key hyperparameters (learning rate, clip \(\epsilon\), GAE \(\lambda\), batch size, number of epochs) to achieve successful landings.
Relate each component (clip, GAE, value loss, entropy bonus) to stability and sample efficiency.

Concept and real-world RL

PPO in practice: collect a rollout of transitions (e.g. 2048 steps), compute GAE advantages, then perform several epochs of minibatch updates on the same data (policy loss with clip + value loss + entropy bonus). The rollout buffer stores states, actions, rewards, log-probs, and values; after each rollout we compute advantages and then iterate over minibatches. LunarLanderContinuous is a 2D landing task with continuous thrust; it is a standard testbed for PPO. In robot control and game AI, this “collect rollout → multiple PPO epochs” loop is the core of most on-policy algorithms.

Where you see this in practice: LunarLander and similar envs are used in tutorials and benchmarks; the same PPO structure scales to MuJoCo and Atari.

Illustration (PPO on LunarLander): Episode return typically improves over training, with some variance. The chart below shows a typical learning curve (mean return per 10 episodes).

Exercise: Implement a full PPO agent for the LunarLanderContinuous-v2 environment. Use a rollout buffer, compute advantages via GAE, and perform multiple epochs of minibatch updates. Tune hyperparameters to achieve successful landing.

Professor’s hints

Rollout: run the policy for N steps (e.g. 2048), store (s, a, r, log_prob, V(s), done). Then compute returns and GAE from rewards and V(s). Append V(s) for the last state (or 0 if done).
Update: for K epochs (e.g. 4–10), shuffle and split the rollout into minibatches. For each minibatch, compute ratio = π(a|s) / π_old(a|s), clipped loss, value loss (MSE to returns), entropy; total loss = -L_CLIP + c1 * value_loss - c2 * entropy. Backward and step.
LunarLanderContinuous: state dim 8, action dim 2 (main engine, side boosters). Reward is positive for landing, negative for crashing and fuel. Success: land without crashing and get positive total return.

Common pitfalls

Reusing old log_probs: You must store \(\log \pi_{old}(a|s)\) during rollout and use it for the ratio \(r_t = \pi(a|s) / \pi_{old}(a|s)\). Do not recompute the old policy after updating.
Advantage normalization: Normalize advantages (zero mean, unit var) per rollout so the scale does not depend on return magnitude; helps with learning rate.

Worked solution (warm-up: PPO objective)

Key idea: PPO maximizes \(\mathbb{E}[ \min(r_t(\theta) \hat{A}t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}t) ]\) where \(r_t = \pi\theta(a_t|s_t)/\pi{old}(a_t|s_t)\). We also add an entropy bonus so the policy stays exploratory. We run multiple epochs on the same batch (with clipping) instead of one pass like A2C; that improves sample efficiency while keeping updates safe.

Extra practice

Warm-up: Why do we do multiple epochs of updates on the same rollout data? What is the risk if we do too many epochs?
Coding: Implement PPO for LunarLanderContinuous. Plot episode return every 10 episodes. How many episodes until you first get a successful landing (positive return)?
Challenge: Ablate: (a) remove the entropy bonus; (b) set \(\epsilon = 0\) (no clip). How does learning stability and final performance change?
Variant: Change rollout length from 2048 to 512 or 4096 steps. Does a shorter rollout hurt GAE accuracy? Does a longer one slow learning per update? Compare learning curves.
Debug: The code below stores the log-probs from the updated network during the ratio computation — it should use the old log-probs stored during rollout. Fix it.

Try it — edit and run (Shift+Enter)

Conceptual: Why does PPO apply multiple gradient steps on the same rollout batch while REINFORCE only does one? Why does the clipping make it safe to do multiple steps?
Recall: List the four components of the full PPO loss (policy loss, value loss, entropy bonus, and their signs/coefficients) from memory.