Problems with standard policy gradients, TRPO, PPO (intuition and implementation), GAE, maximum entropy RL, Soft Actor-Critic (SAC), SAC vs PPO, custom continuous environments, and advanced tuning. Chapters 41–50.
Volume 5: Advanced Policy Optimization
Chapters 41–50 — TRPO, PPO, GAE, PPO implementation, max entropy, SAC, SAC vs PPO, custom envs, hyperparameter tuning.
Large step size and policy collapse in bandit; visualize probabilities.
TRPO constrained optimization and natural gradient; KL constraint.
Clipped surrogate objective; contrast with unclipped.
Generalized Advantage Estimation (GAE) function.
Full PPO for LunarLanderContinuous with GAE and rollout buffer.
Max-entropy objective; why entropy encourages exploration.
SAC for HalfCheetah with automatic temperature tuning.
Compare SAC and PPO on Hopper, Walker2d; when to choose which.
Custom 2D point mass with continuous action; test with SAC.
Weights & Biases sweep for SAC on custom env.
Review Volume 5 (PPO, TRPO, SAC) and preview Volume 6 (Model-Based RL — learning world models and planning).