Chapter 46: Maximum Entropy RL

Learning objectives Derive or state the maximum entropy objective: maximize \(\mathbb{E}[ \sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) ]\) (or equivalent), where \(\mathcal{H}\) is entropy. Explain how the entropy term encourages exploration: higher entropy means more uniform action distribution, so the policy tries more actions. Contrast with standard expected return maximization (no entropy bonus). Concept and real-world RL Maximum entropy RL adds an entropy bonus to the objective so the agent maximizes return and policy entropy. The optimal policy under this objective is more stochastic (explores more) and is often easier to learn (multiple modes, robustness). In robot control, SAC (Soft Actor-Critic) uses this idea with automatic temperature tuning; in game AI and recommendation, entropy regularization (e.g. in PPO) prevents the policy from becoming too deterministic too fast. The temperature \(\alpha\) (or equivalent) controls the trade-off between return and entropy. ...

March 10, 2026 · 3 min · 500 words · codefrydev

Chapter 76: Inverse Reinforcement Learning (IRL)

Learning objectives Implement maximum entropy IRL: given expert trajectories, learn a reward function such that the expert’s policy (approximately) maximizes expected return under that reward. Use a linear reward model (e.g. r(s, a) = w^T φ(s, a)) and forward RL (e.g. value iteration or policy gradient) to compute the optimal policy for the current reward. Iterate between updating the reward to make the expert look better than other policies and solving the forward RL problem. Explain why IRL can recover a reward that explains the expert behavior and then generalize (e.g. to new states) better than pure BC in some settings. Relate IRL to robot navigation (recovering intent from demonstrations) and healthcare (inferring treatment objectives). Concept and real-world RL ...

March 10, 2026 · 4 min · 762 words · codefrydev