Learning objectives

  • Implement a 4×4 gridworld environment (states, actions, transitions, rewards) in code.
  • Implement iterative policy evaluation and stop when values converge.
  • Implement policy iteration (evaluate then improve) and optionally value iteration.

Gridworld in code

States: Use a 4×4 grid. States can be (row, col) or a flat index. Terminal states (0,0) and (3,3) have value 0 and are not updated.

Actions: 0=up, 1=down, 2=left, 3=right. Moving off the grid leaves the agent in place.

Transitions: Deterministic: from (r, c), “up” → (r-1, c) if r > 0 else (r, c). Same for down/left/right with bounds checks.

Rewards: -1 for each step in a non-terminal state. Terminal: no further reward (episode ends).

Example structure (Python):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def step(s, a):
    r, c = s
    if a == 0: r = max(0, r - 1)
    elif a == 1: r = min(3, r + 1)
    elif a == 2: c = max(0, c - 1)
    else: c = min(3, c + 1)
    s_next = (r, c)
    done = s_next in [(0,0), (3,3)]
    reward = -1.0 if not done else 0.0
    return s_next, reward, done

(Adjust for your indexing and terminal handling.)

Iterative policy evaluation in code

  • Initialize \(V(s) = 0\) for all states.
  • Loop: for each non-terminal state \(s\), compute \(V_{new}(s) = \sum_a \pi(a|s) \bigl[ r(s,a) + \gamma V(s’) \bigr]\). For a uniform random policy, \(\pi(a|s) = 0.25\) for each action; \(s’\) and \(r\) come from your transition. Use synchronous updates: compute all \(V_{new}\) from the current \(V\), then set \(V \leftarrow V_{new}\).
  • Stop when \(\max_s |V_{new}(s) - V(s)| < \theta\) (e.g. \(\theta = 10^{-4}\)).

Policy iteration in code

  • Start with an arbitrary policy (e.g. uniform random).
  • Evaluation: Run iterative policy evaluation until convergence to get \(V^\pi\).
  • Improvement: For each non-terminal state \(s\), set \(\pi(s) = \arg\max_a \bigl[ r(s,a) + \gamma V(s’) \bigr]\) (break ties arbitrarily).
  • If the policy changed, go back to evaluation; else stop. The final policy is optimal for this MDP.

Value iteration in code

Alternatively, use value iteration: repeatedly update \(V(s) \leftarrow \max_a \bigl[ r(s,a) + \gamma \sum_{s’} P(s’|s,a) V(s’) \bigr]\) until convergence, then derive the greedy policy from \(V\). See Chapter 9: Value Iteration.

See Gridworld for the environment description, Chapter 7: Policy Evaluation and Chapter 8: Policy Iteration for theory, and Windy Gridworld for the windy variant.