Chapter 43: Proximal Policy Optimization (PPO): Intuition
Learning objectives Explain in your own words how the clipped surrogate objective in PPO prevents too-large policy updates without solving a constrained optimization (unlike TRPO). Write the clipped loss \(L^{CLIP}(\theta)\) and the unclipped (ratio-based) objective; contrast when they differ. Relate the clip range \(\epsilon\) (e.g. 0.2) to how much the policy can change in one update. Concept and real-world RL PPO (Proximal Policy Optimization) keeps the policy update conservative by clipping the probability ratio \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\). The objective is \(L^{CLIP} = \mathbb{E}[ \min( r_t \hat{A}_t, \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t ) ]\): if the advantage is positive, we do not let the ratio exceed \(1+\epsilon\); if negative, we do not let it go below \(1-\epsilon\). So we never encourage a huge increase in probability for a good action (which could overshoot) or a huge decrease for a bad one. In robot control, game AI, and dialogue, PPO is the default choice for policy gradient because it is simple, stable, and effective. ...