Chapter 7: Dynamic Programming — Policy Evaluation

Learning objectives Implement iterative policy evaluation (Bellman expectation updates) for a finite MDP. Use a gridworld with terminal states and interpret the resulting value function. Decide when to stop iterating (e.g. max change below a threshold). Concept and real-world RL Policy evaluation computes \(V^\pi\) for a given policy \(\pi\). Iterative policy evaluation starts from an arbitrary \(V\) (e.g. zeros) and repeatedly applies the Bellman expectation update: \(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). This converges to \(V^\pi\) for finite MDPs. In a gridworld, values spread from terminal states (goal or trap); the result shows “how good” each cell is under the policy. This is the building block for policy iteration (evaluate, then improve the policy). ...

March 10, 2026 · 4 min · 703 words · codefrydev

Chapter 8: Dynamic Programming — Policy Iteration

Learning objectives Implement policy iteration: alternate policy evaluation and greedy policy improvement. Recognize that the policy stabilizes in a finite number of iterations for finite MDPs. Compare the resulting policy and value function with value iteration. Concept and real-world RL Policy iteration alternates two steps: (1) policy evaluation—compute \(V^\pi\) for the current policy \(\pi\); (2) policy improvement—update \(\pi\) to be greedy with respect to \(V^\pi\). The new policy is at least as good as the old (and strictly better unless already optimal). Repeating this process converges to the optimal policy in a finite number of iterations (for finite MDPs). It is a cornerstone of dynamic programming for RL; in practice, we often do only a few evaluation steps (generalized policy iteration) or use value iteration, which interleaves evaluation and improvement in one update. ...

March 10, 2026 · 4 min · 652 words · codefrydev

Windy Gridworld

Learning objectives Understand the Windy Gridworld environment: movement is shifted by a column-dependent wind. Implement the transition model and run iterative policy evaluation and policy iteration on it. Compare with the standard gridworld (no wind). Theory Windy Gridworld (Sutton & Barto) is a rectangular grid (e.g. 7×10) with: States: Cell positions \((row, col)\). Actions: Up, down, left, right (four actions). Wind: Each column has a fixed wind strength (non-negative integer). When the agent takes an action, the resulting row is shifted up by the wind strength (wind blows upward). So from cell \((r, c)\), after applying action “up” you might move to \((r - 1 + \text{wind}[c], c)\); “down” gives \((r + 1 + \text{wind}[c], c)\), etc. The agent cannot go below row 0 or above the grid; positions are clipped to the grid. Terminal state: One goal cell. Typical reward: -1 per step until the goal. So the same action can lead to different next states depending on the column (wind). The MDP is still finite and deterministic given state and action (wind is fixed per column). This makes the problem slightly harder than a plain gridworld and is a good testbed for policy evaluation and policy iteration. ...

March 10, 2026 · 2 min · 392 words · codefrydev

Chapter 9: Dynamic Programming — Value Iteration

Learning objectives Implement value iteration: repeatedly apply the Bellman optimality update for \(V\). Extract the optimal policy as greedy with respect to the converged \(V\). Relate value iteration to policy iteration (one sweep of “improvement” per state, no full evaluation). Concept and real-world RL Value iteration updates the state-value function using the Bellman optimality equation: \(V(s) \leftarrow \max_a \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). It does not maintain an explicit policy; after convergence, the optimal policy is greedy with respect to \(V\). Value iteration is simpler than full policy iteration (no inner evaluation loop) and converges to \(V^*\). It is used in planning when the model is known; in large or continuous spaces we approximate \(V\) or \(Q\) with function approximators and use approximate dynamic programming or model-free methods. ...

March 10, 2026 · 3 min · 624 words · codefrydev

Dynamic Programming: Gridworld in Code

Learning objectives Implement a 4×4 gridworld environment (states, actions, transitions, rewards) in code. Implement iterative policy evaluation and stop when values converge. Implement policy iteration (evaluate then improve) and optionally value iteration. Gridworld in code States: Use a 4×4 grid. States can be (row, col) or a flat index. Terminal states (0,0) and (3,3) have value 0 and are not updated. Actions: 0=up, 1=down, 2=left, 3=right. Moving off the grid leaves the agent in place. ...

March 10, 2026 · 2 min · 390 words · codefrydev

Chapter 10: Limitations of Dynamic Programming

Learning objectives Compute the number of states and transition probabilities for a small finite MDP. Explain why tabular methods (storing a value per state or state-action) do not scale. Describe how function approximation (e.g. linear or neural) generalizes across states. Concept and real-world RL Dynamic programming (policy iteration, value iteration) assumes we can store a value for every state (or state-action) and iterate over all of them. In a 10×10 grid that is 100 states—manageable. In real problems (continuous state spaces, or discrete but huge spaces like board games or high-dimensional sensors), the number of states is enormous or infinite, so we cannot store a table. ...

March 10, 2026 · 4 min · 683 words · codefrydev

Chapter 53: Planning with Known Models

Learning objectives Implement a planner using breadth-first search (BFS) for a gridworld with known deterministic dynamics. Recover the optimal policy (path to goal) and compare with dynamic programming (value iteration) in terms of computation and result. Relate BFS to shortest-path planning in robot navigation. Concept and real-world RL When the model is known and deterministic, we can plan without learning: BFS finds the shortest path from start to goal; value iteration computes optimal values for all states. In robot navigation (grid or graph), BFS is used for pathfinding; DP is used when we need values everywhere (e.g. for reward shaping). Both assume the model is correct; in RL we often learn the model or the value function from data. ...

March 10, 2026 · 3 min · 443 words · codefrydev

Tabular Methods

This page covers the tabular methods you need for the preliminary assessment: policy iteration and value iteration, the difference between Monte Carlo and TD, on-policy vs off-policy learning, and the Q-learning update rule. Back to Preliminary. Why this matters for RL When the state and action spaces are small enough, we can store one value per state (or state-action) and update them from experience or from the model. Dynamic programming does this when we know the model; Monte Carlo and TD do it from samples. Q-learning is the canonical off-policy TD method and is the basis of many deep RL algorithms (e.g. DQN). You need to know how these methods differ and how to write the Q-learning update. ...

March 10, 2026 · 6 min · 1277 words · codefrydev