Chapter 47: Soft Actor-Critic (SAC)
Learning objectives Implement SAC (Soft Actor-Critic) for HalfCheetah: two Q-networks (min for target), policy that maximizes \(Q - \alpha \log \pi\), and automatic temperature tuning so \(\alpha\) targets a desired entropy. Train and compare sample efficiency with PPO (same env, same or similar compute). Concept and real-world RL SAC combines maximum entropy RL with actor-critic: the critic learns two Q-functions (take min for target to reduce overestimation); the actor maximizes \(\mathbb{E}[ Q(s,a) - \alpha \log \pi(a|s) ]\); and \(\alpha\) is updated to keep the policy entropy near a target (e.g. -\dim(a)). SAC is off-policy (replay buffer), so it is often more sample-efficient than PPO on continuous control. In robot control (HalfCheetah, Hopper, Walker), SAC is a standard baseline; in recommendation and trading, off-policy max-ent methods can improve exploration and stability. ...