Volume 5: Advanced Policy Optimization

Chapters 41–50 — TRPO, PPO, GAE, PPO implementation, max entropy, SAC, SAC vs PPO, custom envs, hyperparameter tuning.

Overall Progress 0%

Large step size and policy collapse in bandit; visualize probabilities.

Go to Chapter 41: The Problem with Standard Policy Gradients →

TRPO constrained optimization and natural gradient; KL constraint.

Go to Chapter 42: Trust Region Policy Optimization (TRPO) →

Clipped surrogate objective; contrast with unclipped.

Go to Chapter 43: Proximal Policy Optimization (PPO): Intuition →

Generalized Advantage Estimation (GAE) function.

Go to Chapter 44: PPO: Implementation Details →

Full PPO for LunarLanderContinuous with GAE and rollout buffer.

Go to Chapter 45: Coding PPO from Scratch →

Max-entropy objective; why entropy encourages exploration.

Go to Chapter 46: Maximum Entropy RL →

SAC for HalfCheetah with automatic temperature tuning.

Go to Chapter 47: Soft Actor-Critic (SAC) →

Compare SAC and PPO on Hopper, Walker2d; when to choose which.

Go to Chapter 48: SAC vs. PPO →

Custom 2D point mass with continuous action; test with SAC.

Go to Chapter 49: Custom Gym Environments (Part 2) →

Weights & Biases sweep for SAC on custom env.

Go to Chapter 50: Advanced Hyperparameter Tuning →

Review Volume 5 (PPO, TRPO, SAC) and preview Volume 6 (Model-Based RL — learning world models and planning).

Go to Volume 5 Review & Bridge to Volume 6 →

Problems with standard policy gradients, TRPO, PPO (intuition and implementation), GAE, maximum entropy RL, Soft Actor-Critic (SAC), SAC vs PPO, custom continuous environments, and advanced tuning. Chapters 41–50.