PPO | Reinforcement Learning Curriculum

A practical guide to reading reinforcement learning research papers: structure, notation, and three annotated examples (DQN, PPO, SAC).

10–12 questions on DQN, policy gradient, PPO, replay, target network. Solutions included.

5 quick questions after Chapters 41–45 of Volume 5. Check you're ready to continue.

Clipped surrogate objective; contrast with unclipped.

Generalized Advantage Estimation (GAE) function.

Full PPO for LunarLanderContinuous with GAE and rollout buffer.

Compare SAC and PPO on Hopper, Walker2d; when to choose which.

Compare Dreamer and PPO sample efficiency on Walker.

Expert demos from PPO on LunarLander; behavioral cloning.

Bradley-Terry from pairwise comparisons; train policy with PPO.

PPO fine-tune small LM (e.g. GPT-2) for sentiment; KL penalty.

Simulated preference data; Bradley-Terry reward model; PPO finetune.

DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.

PPO on 10 seeds; mean, std; rliable confidence intervals.

Review Volume 4 (Policy Gradients, Actor-Critic, DDPG, TD3) and preview Volume 5 (PPO, TRPO, SAC — stable, scalable policy optimization).

Review Volume 5 (PPO, TRPO, SAC) and preview Volume 6 (Model-Based RL — learning world models and planning).