Chapter 9: Dynamic Programming — Value Iteration
Learning objectives Implement value iteration: repeatedly apply the Bellman optimality update for \(V\). Extract the optimal policy as greedy with respect to the converged \(V\). Relate value iteration to policy iteration (one sweep of “improvement” per state, no full evaluation). Concept and real-world RL Value iteration updates the state-value function using the Bellman optimality equation: \(V(s) \leftarrow \max_a \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). It does not maintain an explicit policy; after convergence, the optimal policy is greedy with respect to \(V\). Value iteration is simpler than full policy iteration (no inner evaluation loop) and converges to \(V^*\). It is used in planning when the model is known; in large or continuous spaces we approximate \(V\) or \(Q\) with function approximators and use approximate dynamic programming or model-free methods. ...