PPO
A practical guide to reading reinforcement learning research papers: structure, notation, and three annotated examples (DQN, PPO, SAC).
10–12 questions on DQN, policy gradient, PPO, replay, target network. Solutions included.
5 quick questions after Chapters 41–45 of Volume 5. Check you're ready to continue.
Clipped surrogate objective; contrast with unclipped.
Generalized Advantage Estimation (GAE) function.
Full PPO for LunarLanderContinuous with GAE and rollout buffer.
Compare SAC and PPO on Hopper, Walker2d; when to choose which.
Compare Dreamer and PPO sample efficiency on Walker.
Expert demos from PPO on LunarLander; behavioral cloning.
Bradley-Terry from pairwise comparisons; train policy with PPO.
PPO fine-tune small LM (e.g. GPT-2) for sentiment; KL penalty.
Simulated preference data; Bradley-Terry reward model; PPO finetune.
DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.
PPO on 10 seeds; mean, std; rliable confidence intervals.
Review Volume 4 (Policy Gradients, Actor-Critic, DDPG, TD3) and preview Volume 5 (PPO, TRPO, SAC — stable, scalable policy optimization).
Review Volume 5 (PPO, TRPO, SAC) and preview Volume 6 (Model-Based RL — learning world models and planning).