Chapter 8: Dynamic Programming — Policy Iteration

Learning objectives Implement policy iteration: alternate policy evaluation and greedy policy improvement. Recognize that the policy stabilizes in a finite number of iterations for finite MDPs. Compare the resulting policy and value function with value iteration. Concept and real-world RL Policy iteration alternates two steps: (1) policy evaluation—compute \(V^\pi\) for the current policy \(\pi\); (2) policy improvement—update \(\pi\) to be greedy with respect to \(V^\pi\). The new policy is at least as good as the old (and strictly better unless already optimal). Repeating this process converges to the optimal policy in a finite number of iterations (for finite MDPs). It is a cornerstone of dynamic programming for RL; in practice, we often do only a few evaluation steps (generalized policy iteration) or use value iteration, which interleaves evaluation and improvement in one update. ...

March 10, 2026 · 4 min · 652 words · codefrydev

Chapter 9: Dynamic Programming — Value Iteration

Learning objectives Implement value iteration: repeatedly apply the Bellman optimality update for \(V\). Extract the optimal policy as greedy with respect to the converged \(V\). Relate value iteration to policy iteration (one sweep of “improvement” per state, no full evaluation). Concept and real-world RL Value iteration updates the state-value function using the Bellman optimality equation: \(V(s) \leftarrow \max_a \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). It does not maintain an explicit policy; after convergence, the optimal policy is greedy with respect to \(V\). Value iteration is simpler than full policy iteration (no inner evaluation loop) and converges to \(V^*\). It is used in planning when the model is known; in large or continuous spaces we approximate \(V\) or \(Q\) with function approximators and use approximate dynamic programming or model-free methods. ...

March 10, 2026 · 3 min · 624 words · codefrydev