Chapter 48: SAC vs. PPO
Learning objectives Run SAC and PPO on the same continuous control tasks (e.g. Hopper, Walker2d). Compare final performance, sample efficiency (return vs env steps), and wall-clock time. Discuss when to choose one over the other (sample efficiency, stability, tuning, off-policy vs on-policy). Concept and real-world RL SAC is off-policy (replay buffer) and maximizes entropy; PPO is on-policy (rollouts) and uses a clipped objective. SAC often achieves higher sample efficiency (fewer env steps to reach good performance) but can be sensitive to hyperparameters and replay buffer size; PPO is more robust and easier to tune in many settings. In robot control benchmarks (Hopper, Walker2d, HalfCheetah), both are standard; in game AI and RLHF, PPO is more common. Choice depends on data cost (can we afford many env steps?), need for off-policy (e.g. using logged data), and engineering preference. ...