Learning objectives
- Implement iterative policy evaluation (Bellman expectation updates) for a finite MDP.
- Use a gridworld with terminal states and interpret the resulting value function.
- Decide when to stop iterating (e.g. max change below a threshold).
Concept and real-world RL
Policy evaluation computes \(V^\pi\) for a given policy \(\pi\). Iterative policy evaluation starts from an arbitrary \(V\) (e.g. zeros) and repeatedly applies the Bellman expectation update: \(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). This converges to \(V^\pi\) for finite MDPs. In a gridworld, values spread from terminal states (goal or trap); the result shows “how good” each cell is under the policy. This is the building block for policy iteration (evaluate, then improve the policy).
Illustration (convergence): As we apply the Bellman expectation update sweep after sweep, the maximum change in \(V\) typically decreases. The chart below shows a typical trend: max \(|\Delta V|\) over the first few sweeps until convergence.
Exercise: Implement iterative policy evaluation for the gridworld (4×4, no obstacles, terminal states at top-left and bottom-right, rewards -1 per step). Use a uniform random policy and \(\gamma=1\). Stop when the maximum change is less than 1e-4.
Professor’s hints
- Represent the grid as a 4×4 array; terminal states (0,0) and (3,3) typically have value 0 (no further reward). Do not update \(V\) for terminal states in the loop, or set them to 0 and keep them fixed.
- Uniform random policy: from each non-terminal state, each of the 4 actions (up, down, left, right) has probability 0.25. If an action would go off the grid, the agent stays in place (and gets -1). So from (0,0) you do not update (it’s terminal); from (0,1) “left” stays (0,1) with -1.
- Update: for each non-terminal state \(s\), compute the new value as the Bellman expectation (average over 4 actions of immediate reward + \(\gamma\) × value of next state). Use synchronous updates: compute all new values from the old \(V\), then replace \(V\) with the new values.
- Stop when \(\max_s |V_{new}(s) - V_{old}(s)| < 10^{-4}\).
Common pitfalls
- Updating terminal states: Terminal states should have value 0 (or by definition no update). Do not apply the Bellman update to (0,0) and (3,3); leave them at 0.
- Asynchronous vs synchronous: Synchronous = use the same \(V\) for all updates in one sweep. If you update in place (state by state), later states in the same sweep see already-updated values—that is asynchronous and can change convergence; stick to synchronous for this exercise.
- Wrong transition model: From a corner, two actions might hit the wall (stay, -1) and two might move. Count correctly; the probability of each next state depends on how many actions lead there.
Warm-up: For a 1×3 line (states 0, 1, 2), terminal at 0 and 2 with value 0, reward -1 per step, one action “move left” from 1 to 0, one action “move right” from 1 to 2. Compute \(V(1)\) by hand for \(\gamma=1\). Step 1: From state 1 we take one action and reach either state 0 or state 2 (each terminal with value 0). So we get one step of reward -1 and then 0. Thus \(V(1) = -1 + \gamma \cdot 0 = -1\) (for \(\gamma=1\) or any \(\gamma\): \(V(1) = -1\)). Explanation: One step to either terminal gives immediate reward -1 and then value 0. The same Bellman expectation structure is used in the 4×4 gridworld: each state’s value is immediate reward plus \(\gamma\) times the value of the next state, averaged over the policy.Worked solution (warm-up: 1×3 line)
Extra practice
- Warm-up: For a 1×3 line (states 0, 1, 2), terminal at 0 and 2 with value 0, reward -1 per step, one action “move left” from 1 to 0, one action “move right” from 1 to 2. Compute \(V(1)\) by hand for \(\gamma=1\) (one step to either terminal ⇒ -1).
- Coding: Run policy evaluation (iterative Bellman expectation) on a 4×4 gridworld until convergence. Plot \(V(s)\) as a heatmap for a random policy (e.g. uniform over actions).
- Challenge: After convergence, derive the greedy policy with respect to your \(V\): in each state, which action(s) maximize immediate reward + \(V(s’)\)? Plot or print the greedy policy as arrows. Compare to the optimal policy (shortest path to nearest terminal).