You have finished Volume 2. Before starting Volume 3, take this 10-minute review.
Volume 2 Recap Quiz
Q1. What is the TD error and why is it useful?
Q2. What makes Q-learning off-policy?
Q3. What is the key advantage of TD methods over Monte Carlo?
Q4. What is the tabular Q-table, and why does it break down for CartPole?
A tabular Q-table stores one Q(s,a) value per (state, action) pair in a dictionary or array. For a discrete gridworld with 9 states and 4 actions, the table has 36 entries — manageable.
For CartPole, the state is [cart position, cart velocity, pole angle, pole angular velocity] — four continuous values. Even coarsely discretizing each into 10 bins gives 10^4 = 10,000 states × 2 actions = 20,000 entries. More realistic discretizations (100 bins each) give 10^8 entries. The table explodes exponentially with state dimension (the curse of dimensionality).
Q5. What does it mean to 'generalize' in RL, and why can't tabular methods do it?
What Changes in Volume 3
| Volume 2 (Tabular) | Volume 3 (Function Approximation) | |
|---|---|---|
| State representation | Discrete index into table | Feature vector / raw pixels |
| Value storage | Q-table (one entry per state-action) | Neural network weights |
| State space | Small, discrete | Large, continuous, or image-based |
| Generalization | None — each state independent | Yes — similar inputs → similar outputs |
| Key algorithms | SARSA, Q-learning, n-step | Linear FA, DQN, Double DQN, Dueling DQN |
| Key challenge | Curse of dimensionality | Training stability (deadly triad) |
The key insight: Replace the Q-table Q(s,a) with a parametric function Q(s,a; θ) — a neural network. The weights θ are shared across all states, enabling generalization. The update rule becomes a gradient descent step instead of a table lookup.
Bridge Exercise: From Q-table to Q-network
First, see how the Q-table explodes for continuous states:
Now see the neural network alternative:
What changed
Ready for Volume 3?
Before continuing, confirm:
- I can write the Q-learning and SARSA update rules from memory and explain the difference.
- I understand why the Q-table fails for CartPole (dimensionality argument).
- I understand the bridge exercise: fixed parameters instead of per-state entries.
- I know what “bootstrapping” means (using current estimates as targets).