Chapter 7: Dynamic Programming — Policy Evaluation
Learning objectives Implement iterative policy evaluation (Bellman expectation updates) for a finite MDP. Use a gridworld with terminal states and interpret the resulting value function. Decide when to stop iterating (e.g. max change below a threshold). Concept and real-world RL Policy evaluation computes \(V^\pi\) for a given policy \(\pi\). Iterative policy evaluation starts from an arbitrary \(V\) (e.g. zeros) and repeatedly applies the Bellman expectation update: \(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). This converges to \(V^\pi\) for finite MDPs. In a gridworld, values spread from terminal states (goal or trap); the result shows “how good” each cell is under the policy. This is the building block for policy iteration (evaluate, then improve the policy). ...