Chapter 48: SAC vs. PPO

Learning objectives Run SAC and PPO on the same continuous control tasks (e.g. Hopper, Walker2d). Compare final performance, sample efficiency (return vs env steps), and wall-clock time. Discuss when to choose one over the other (sample efficiency, stability, tuning, off-policy vs on-policy). Concept and real-world RL SAC is off-policy (replay buffer) and maximizes entropy; PPO is on-policy (rollouts) and uses a clipped objective. SAC often achieves higher sample efficiency (fewer env steps to reach good performance) but can be sensitive to hyperparameters and replay buffer size; PPO is more robust and easier to tune in many settings. In robot control benchmarks (Hopper, Walker2d, HalfCheetah), both are standard; in game AI and RLHF, PPO is more common. Choice depends on data cost (can we afford many env steps?), need for off-policy (e.g. using logged data), and engineering preference. ...

March 10, 2026 · 3 min · 481 words · codefrydev

Chapter 71: The Offline RL Problem

Learning objectives Collect a dataset of transitions (state, action, reward, next_state, done) from a random policy (or fixed behavior policy) in the Hopper environment. Train a standard SAC agent offline (no environment interaction) on this dataset and observe the overestimation of Q-values for out-of-distribution (OOD) actions. Explain why naive off-policy methods fail in offline RL: the policy is trained to maximize Q, but Q is only trained on in-distribution actions; for OOD actions Q can be overestimated. Identify the distributional shift between the behavior policy (that collected the data) and the learned policy. Relate the offline RL problem to recommendation and healthcare where data comes from logs or historical trials. Concept and real-world RL ...

March 10, 2026 · 4 min · 723 words · codefrydev