Chapter 76: Inverse Reinforcement Learning (IRL)
Learning objectives Implement maximum entropy IRL: given expert trajectories, learn a reward function such that the expert’s policy (approximately) maximizes expected return under that reward. Use a linear reward model (e.g. r(s, a) = w^T φ(s, a)) and forward RL (e.g. value iteration or policy gradient) to compute the optimal policy for the current reward. Iterate between updating the reward to make the expert look better than other policies and solving the forward RL problem. Explain why IRL can recover a reward that explains the expert behavior and then generalize (e.g. to new states) better than pure BC in some settings. Relate IRL to robot navigation (recovering intent from demonstrations) and healthcare (inferring treatment objectives). Concept and real-world RL ...