Chapter 8: Dynamic Programming — Policy Iteration

Learning objectives Implement policy iteration: alternate policy evaluation and greedy policy improvement. Recognize that the policy stabilizes in a finite number of iterations for finite MDPs. Compare the resulting policy and value function with value iteration. Concept and real-world RL Policy iteration alternates two steps: (1) policy evaluation—compute \(V^\pi\) for the current policy \(\pi\); (2) policy improvement—update \(\pi\) to be greedy with respect to \(V^\pi\). The new policy is at least as good as the old (and strictly better unless already optimal). Repeating this process converges to the optimal policy in a finite number of iterations (for finite MDPs). It is a cornerstone of dynamic programming for RL; in practice, we often do only a few evaluation steps (generalized policy iteration) or use value iteration, which interleaves evaluation and improvement in one update. ...

March 10, 2026 · 4 min · 652 words · codefrydev