Volume 1: Mathematical Foundations

Chapters 1–10 — RL framework, bandits, MDPs, reward hypothesis, value functions, Bellman equations, dynamic programming.

Overall Progress 0%

Gridworld discounted return from a sequence of actions.

Go to Chapter 1: The Reinforcement Learning Framework →

10-armed testbed with epsilon-greedy vs greedy.

Go to Chapter 2: Multi-Armed Bandits →

Using optimistic initial Q-values to encourage early exploration in multi-armed bandits.

Go to Bandits: Optimistic Initial Values →

Two-state MDP transition probability matrices.

Go to Chapter 3: Markov Decision Processes (MDPs) →

Upper Confidence Bound (UCB1) algorithm for multi-armed bandits—balance exploration and exploitation using uncertainty.

Go to Bandits: UCB1 →

Reward function for self-driving car and reward hacking.

Go to Chapter 4: The Reward Hypothesis →

The classic gridworld environment: states, actions, transitions, and terminal states.

Go to Gridworld →

Bayesian bandits and Thompson Sampling—sample from the posterior to balance exploration and exploitation.

Go to Bandits: Thompson Sampling →

State-value function V^π for random policy on Chapter 3 MDP.

Go to Chapter 5: Value Functions →

How to design reward signals for MDPs and gridworld—shaping, terminal rewards, and step penalties.

Go to Choosing Rewards →

When reward distributions change over time—exponential recency-weighted average and constant step size.

Go to Bandits: Nonstationary →

Derive Bellman optimality equation for Q*(s,a).

Go to Chapter 6: The Bellman Equations →

When to implement bandits from scratch vs. use existing libraries—learning goals and control.

Go to Bandits: Why don't we just use a library? →

Iterative policy evaluation on 4×4 gridworld.

Go to Chapter 7: Dynamic Programming — Policy Evaluation →

Policy iteration and comparison with value iteration.

Go to Chapter 8: Dynamic Programming — Policy Iteration →

Gridworld with wind: actions are shifted by a wind effect. Theory and code for policy evaluation and policy iteration.

Go to Windy Gridworld →

Value iteration on 4×4 gridworld, optimal V and policy.

Go to Chapter 9: Dynamic Programming — Value Iteration →

Code walkthrough for gridworld, iterative policy evaluation, and policy iteration.

Go to Dynamic Programming: Gridworld in Code →

State and transition count for 10×10 gridworld; function approximation.

Go to Chapter 10: Limitations of Dynamic Programming →

15 short drill problems for Volume 1: discounted return, MDPs, value functions, Bellman equations, and dynamic programming.

Go to Volume 1 Drills — Mathematical Foundations →

Review Volume 1 concepts and preview Volume 2. From dynamic programming (model-given) to model-free methods.

Go to Volume 1 Review & Bridge to Volume 2 →

From the reinforcement learning framework and multi-armed bandits through MDPs, value functions, Bellman equations, and dynamic programming (policy evaluation, policy iteration, value iteration). Chapters 1–10.