You have finished Volume 2. Before starting Volume 3, take this 10-minute review.


Volume 2 Recap Quiz

Q1. What is the TD error and why is it useful?
The TD error is δ_t = R_{t+1} + γ V(S_{t+1}) − V(S_t). It measures how much the current estimate V(S_t) differs from a one-step bootstrapped target. It is the signal used to update value estimates. When δ_t = 0 everywhere, the values are self-consistent (a fixed point of the Bellman operator).
Q2. What makes Q-learning off-policy?
Q-learning updates Q(s,a) toward the maximum next-state Q-value — the target policy is greedy. But the agent may be following a different behavior policy (e.g. ε-greedy) during experience collection. Because target ≠ behavior policy, Q-learning is off-policy. This lets it learn the optimal Q* even while exploring.
Q3. What is the key advantage of TD methods over Monte Carlo?
TD methods can update online after every step — they do not need to wait until the end of an episode. This makes them applicable to continuing tasks (no episode boundary) and much faster to learn in long-episode environments. Monte Carlo must wait for the full return, which can be very high variance in long episodes.
Q4. What is the tabular Q-table, and why does it break down for CartPole?

A tabular Q-table stores one Q(s,a) value per (state, action) pair in a dictionary or array. For a discrete gridworld with 9 states and 4 actions, the table has 36 entries — manageable.

For CartPole, the state is [cart position, cart velocity, pole angle, pole angular velocity] — four continuous values. Even coarsely discretizing each into 10 bins gives 10^4 = 10,000 states × 2 actions = 20,000 entries. More realistic discretizations (100 bins each) give 10^8 entries. The table explodes exponentially with state dimension (the curse of dimensionality).

Q5. What does it mean to 'generalize' in RL, and why can't tabular methods do it?
Generalization: learning that similar states have similar values, so experience with one state informs estimates for nearby states. Tabular methods treat every state independently — seeing state (1.05, …) tells you nothing about (1.06, …). Neural networks can generalize: shared weights mean that gradient updates to one input region affect similar inputs.

What Changes in Volume 3

Volume 2 (Tabular)Volume 3 (Function Approximation)
State representationDiscrete index into tableFeature vector / raw pixels
Value storageQ-table (one entry per state-action)Neural network weights
State spaceSmall, discreteLarge, continuous, or image-based
GeneralizationNone — each state independentYes — similar inputs → similar outputs
Key algorithmsSARSA, Q-learning, n-stepLinear FA, DQN, Double DQN, Dueling DQN
Key challengeCurse of dimensionalityTraining stability (deadly triad)

The key insight: Replace the Q-table Q(s,a) with a parametric function Q(s,a; θ) — a neural network. The weights θ are shared across all states, enabling generalization. The update rule becomes a gradient descent step instead of a table lookup.


Bridge Exercise: From Q-table to Q-network

First, see how the Q-table explodes for continuous states:

Try it — edit and run (Shift+Enter)

Now see the neural network alternative:

Try it — edit and run (Shift+Enter)
What changed
The Q-table has been replaced by a weight matrix. Instead of a lookup, we compute a dot product. The number of parameters is fixed (8 weights + 2 biases = 10), regardless of how many unique states the agent visits. Volume 3 extends this to deep networks (DQN) and adds techniques (replay buffer, target network) to make training stable.

Ready for Volume 3?

Before continuing, confirm:

  • I can write the Q-learning and SARSA update rules from memory and explain the difference.
  • I understand why the Q-table fails for CartPole (dimensionality argument).
  • I understand the bridge exercise: fixed parameters instead of per-state entries.
  • I know what “bootstrapping” means (using current estimates as targets).

Next: Volume 3: Function Approximation & DQN