Chapter 46: Maximum Entropy RL

Learning objectives Derive or state the maximum entropy objective: maximize \(\mathbb{E}[ \sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) ]\) (or equivalent), where \(\mathcal{H}\) is entropy. Explain how the entropy term encourages exploration: higher entropy means more uniform action distribution, so the policy tries more actions. Contrast with standard expected return maximization (no entropy bonus). Concept and real-world RL Maximum entropy RL adds an entropy bonus to the objective so the agent maximizes return and policy entropy. The optimal policy under this objective is more stochastic (explores more) and is often easier to learn (multiple modes, robustness). In robot control, SAC (Soft Actor-Critic) uses this idea with automatic temperature tuning; in game AI and recommendation, entropy regularization (e.g. in PPO) prevents the policy from becoming too deterministic too fast. The temperature \(\alpha\) (or equivalent) controls the trade-off between return and entropy. ...

March 10, 2026 · 3 min · 500 words · codefrydev