Learning objectives
- Run SAC and PPO on the same continuous control tasks (e.g. Hopper, Walker2d).
- Compare final performance, sample efficiency (return vs env steps), and wall-clock time.
- Discuss when to choose one over the other (sample efficiency, stability, tuning, off-policy vs on-policy).
Concept and real-world RL
SAC is off-policy (replay buffer) and maximizes entropy; PPO is on-policy (rollouts) and uses a clipped objective. SAC often achieves higher sample efficiency (fewer env steps to reach good performance) but can be sensitive to hyperparameters and replay buffer size; PPO is more robust and easier to tune in many settings. In robot control benchmarks (Hopper, Walker2d, HalfCheetah), both are standard; in game AI and RLHF, PPO is more common. Choice depends on data cost (can we afford many env steps?), need for off-policy (e.g. using logged data), and engineering preference.
Where you see this in practice: Benchmarks (e.g. MuJoCo) report both; industry often standardizes on PPO for simplicity or SAC for sample efficiency.
Illustration (SAC vs PPO sample efficiency): For the same task, SAC often reaches a given return in fewer env steps. The chart below compares mean return vs env steps (conceptual).
Exercise: Run both SAC and PPO on the same set of continuous control tasks (e.g., Hopper, Walker2d). Compare final performance, sample efficiency, and wall-clock time. Discuss when you might choose one over the other.
Professor’s hints
- Same seeds and run length (e.g. 1M steps) for fair comparison. Plot mean return (over last 10 eval episodes) vs steps and vs wall-clock time.
- Sample efficiency: which algorithm reaches a given return (e.g. 2000 for Hopper) in fewer steps? Wall-clock: which is faster per step (PPO often does more compute per step due to multiple epochs)?
- When to choose: PPO when you want simplicity and stability; SAC when sample efficiency matters and you can tune (or use defaults).
Common pitfalls
- Different observation/action preprocessing: Use the same env wrapper and normalization for both so the comparison is fair.
- Single run: Run multiple seeds (e.g. 3–5) and report mean and std of final return.
Worked solution (warm-up: reporting results)
Extra practice
- Warm-up: List one advantage of PPO over SAC and one advantage of SAC over PPO.
- Coding: Run SAC and PPO on Hopper for 500k steps each (3 seeds). Plot learning curves with standard error. Which has higher final return on average?
- Challenge: On a task where PPO is sample-inefficient, try PPO with a larger rollout (e.g. 4096 steps) and more epochs. Does it close the gap with SAC?
- Variant: Run both algorithms on Walker2d instead of Hopper. Does the ranking change? Are there tasks where PPO outperforms SAC in final return (not just stability)?
- Debug: The comparison below uses different observation normalization for SAC and PPO, making the comparison unfair. Explain the bug and how to fix it.
| |
- Conceptual: PPO is on-policy and SAC is off-policy. Give one scenario where off-policy learning is especially advantageous (e.g. a setting with expensive data collection).
- Recall: List two scenarios where PPO is typically preferred over SAC (e.g. RLHF, discrete actions) and two where SAC is preferred (e.g. sample efficiency, continuous control).