Volume 1: Mathematical Foundations

From the reinforcement learning framework and multi-armed bandits through MDPs, value functions, Bellman equations, and dynamic programming (policy evaluation, policy iteration, value iteration). Chapters 1–10.

Bandits: Nonstationary

Learning objectives Understand why a plain sample mean is bad when reward distributions change over time. Use exponential recency-weighted average (constant step size) for nonstationary bandits. Implement and compare fixed step size vs. sample mean on a drifting testbed. Theory In nonstationary bandits, the expected reward of each arm can change over time. The sample mean update \(\bar{Q}_{n+1} = \bar{Q}_n + \frac{1}{n+1}(r - \bar{Q}_n)\) gives equal weight to all past rewards, so old data can dominate and the agent is slow to adapt. ...

Chapter 6: The Bellman Equations

Learning objectives Derive the Bellman optimality equation for \(Q^*(s,a)\) from the definition of optimal action value. Contrast the optimality equation (max over actions) with the expectation equation (average over actions under \(\pi\)). Explain why the optimality equations are nonlinear and how algorithms (e.g. value iteration) handle them. Concept and real-world RL The optimal action-value function \(Q^(s,a)\) is the expected return from state \(s\), taking action \(a\), then acting optimally. The Bellman optimality equation for \(Q^\) states that \(Q^(s,a)\) equals the expected immediate reward plus \(\gamma\) times the maximum over next-state action values (not an average under a policy). This “max” makes the system nonlinear: the optimal policy is greedy with respect to \(Q^\), and \(Q^\) is the fixed point of this equation. Value iteration and Q-learning are built on this; in practice, we approximate \(Q^\) with tables or function approximators. ...

Bandits: Why don't we just use a library?

Learning objectives Understand why this curriculum has you implement bandits (and other algorithms) from scratch. Know when it is appropriate to switch to a library in practice. Why implement from scratch? Understanding: Writing the update equations and selection rules yourself forces you to understand how they work. If you only call library.solve(), you may not know what step size, prior, or exploration rule is being used—or how to debug when things go wrong. ...

Chapter 7: Dynamic Programming — Policy Evaluation

Learning objectives Implement iterative policy evaluation (Bellman expectation updates) for a finite MDP. Use a gridworld with terminal states and interpret the resulting value function. Decide when to stop iterating (e.g. max change below a threshold). Concept and real-world RL Policy evaluation computes \(V^\pi\) for a given policy \(\pi\). Iterative policy evaluation starts from an arbitrary \(V\) (e.g. zeros) and repeatedly applies the Bellman expectation update: \(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). This converges to \(V^\pi\) for finite MDPs. In a gridworld, values spread from terminal states (goal or trap); the result shows “how good” each cell is under the policy. This is the building block for policy iteration (evaluate, then improve the policy). ...

Chapter 8: Dynamic Programming — Policy Iteration

Learning objectives Implement policy iteration: alternate policy evaluation and greedy policy improvement. Recognize that the policy stabilizes in a finite number of iterations for finite MDPs. Compare the resulting policy and value function with value iteration. Concept and real-world RL Policy iteration alternates two steps: (1) policy evaluation—compute \(V^\pi\) for the current policy \(\pi\); (2) policy improvement—update \(\pi\) to be greedy with respect to \(V^\pi\). The new policy is at least as good as the old (and strictly better unless already optimal). Repeating this process converges to the optimal policy in a finite number of iterations (for finite MDPs). It is a cornerstone of dynamic programming for RL; in practice, we often do only a few evaluation steps (generalized policy iteration) or use value iteration, which interleaves evaluation and improvement in one update. ...

Windy Gridworld

Learning objectives Understand the Windy Gridworld environment: movement is shifted by a column-dependent wind. Implement the transition model and run iterative policy evaluation and policy iteration on it. Compare with the standard gridworld (no wind). Theory Windy Gridworld (Sutton & Barto) is a rectangular grid (e.g. 7×10) with: States: Cell positions \((row, col)\). Actions: Up, down, left, right (four actions). Wind: Each column has a fixed wind strength (non-negative integer). When the agent takes an action, the resulting row is shifted up by the wind strength (wind blows upward). So from cell \((r, c)\), after applying action “up” you might move to \((r - 1 + \text{wind}[c], c)\); “down” gives \((r + 1 + \text{wind}[c], c)\), etc. The agent cannot go below row 0 or above the grid; positions are clipped to the grid. Terminal state: One goal cell. Typical reward: -1 per step until the goal. So the same action can lead to different next states depending on the column (wind). The MDP is still finite and deterministic given state and action (wind is fixed per column). This makes the problem slightly harder than a plain gridworld and is a good testbed for policy evaluation and policy iteration. ...

Chapter 9: Dynamic Programming — Value Iteration

Learning objectives Implement value iteration: repeatedly apply the Bellman optimality update for \(V\). Extract the optimal policy as greedy with respect to the converged \(V\). Relate value iteration to policy iteration (one sweep of “improvement” per state, no full evaluation). Concept and real-world RL Value iteration updates the state-value function using the Bellman optimality equation: \(V(s) \leftarrow \max_a \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). It does not maintain an explicit policy; after convergence, the optimal policy is greedy with respect to \(V\). Value iteration is simpler than full policy iteration (no inner evaluation loop) and converges to \(V^*\). It is used in planning when the model is known; in large or continuous spaces we approximate \(V\) or \(Q\) with function approximators and use approximate dynamic programming or model-free methods. ...

Dynamic Programming: Gridworld in Code

Learning objectives Implement a 4×4 gridworld environment (states, actions, transitions, rewards) in code. Implement iterative policy evaluation and stop when values converge. Implement policy iteration (evaluate then improve) and optionally value iteration. Gridworld in code States: Use a 4×4 grid. States can be (row, col) or a flat index. Terminal states (0,0) and (3,3) have value 0 and are not updated. Actions: 0=up, 1=down, 2=left, 3=right. Moving off the grid leaves the agent in place. ...

Chapter 10: Limitations of Dynamic Programming

Learning objectives Compute the number of states and transition probabilities for a small finite MDP. Explain why tabular methods (storing a value per state or state-action) do not scale. Describe how function approximation (e.g. linear or neural) generalizes across states. Concept and real-world RL Dynamic programming (policy iteration, value iteration) assumes we can store a value for every state (or state-action) and iterate over all of them. In a 10×10 grid that is 100 states—manageable. In real problems (continuous state spaces, or discrete but huge spaces like board games or high-dimensional sensors), the number of states is enormous or infinite, so we cannot store a table. ...