Chapter 42: Trust Region Policy Optimization (TRPO)

Learning objectives Read and summarize the TRPO paper: the constrained optimization problem (maximize expected advantage subject to KL constraint between old and new policy). Explain why the natural gradient (using the Fisher information matrix) approximates the KL-constrained step. Relate the KL constraint to preventing too-large policy updates (connection to Chapter 41). Concept and real-world RL TRPO (Trust Region Policy Optimization) limits each policy update so that the new policy stays close to the old one in the sense of KL divergence: maximize \(\mathbb{E}[ \frac{\pi(a|s)}{\pi_{old}(a|s)} A^{old}(s,a) ]\) subject to \(\mathbb{E}[ D_{KL}(\pi_{old} \| \pi) ] \leq \delta\). This prevents the collapse and instability seen in vanilla policy gradients. The natural gradient (preconditioning by the Fisher information matrix) gives an approximate solution to this constrained problem. In robot control and safety-critical settings, TRPO’s monotonic improvement guarantee (under assumptions) is appealing; in practice PPO is often preferred for its simpler implementation (clipped objective instead of constrained optimization). ...

March 10, 2026 · 3 min · 551 words · codefrydev