From the reinforcement learning framework and multi-armed bandits through MDPs, value functions, Bellman equations, and dynamic programming (policy evaluation, policy iteration, value iteration). Chapters 1–10.
Volume 1: Mathematical Foundations
Chapters 1–10 — RL framework, bandits, MDPs, reward hypothesis, value functions, Bellman equations, dynamic programming.
Gridworld discounted return from a sequence of actions.
10-armed testbed with epsilon-greedy vs greedy.
Using optimistic initial Q-values to encourage early exploration in multi-armed bandits.
Two-state MDP transition probability matrices.
Upper Confidence Bound (UCB1) algorithm for multi-armed bandits—balance exploration and exploitation using uncertainty.
Reward function for self-driving car and reward hacking.
The classic gridworld environment: states, actions, transitions, and terminal states.
Bayesian bandits and Thompson Sampling—sample from the posterior to balance exploration and exploitation.
State-value function V^π for random policy on Chapter 3 MDP.
How to design reward signals for MDPs and gridworld—shaping, terminal rewards, and step penalties.
When reward distributions change over time—exponential recency-weighted average and constant step size.
Derive Bellman optimality equation for Q*(s,a).
When to implement bandits from scratch vs. use existing libraries—learning goals and control.
Iterative policy evaluation on 4×4 gridworld.
Policy iteration and comparison with value iteration.
Gridworld with wind: actions are shifted by a wind effect. Theory and code for policy evaluation and policy iteration.
Value iteration on 4×4 gridworld, optimal V and policy.
Code walkthrough for gridworld, iterative policy evaluation, and policy iteration.
State and transition count for 10×10 gridworld; function approximation.
15 short drill problems for Volume 1: discounted return, MDPs, value functions, Bellman equations, and dynamic programming.
Review Volume 1 concepts and preview Volume 2. From dynamic programming (model-given) to model-free methods.