You have finished Volume 1. Before starting Volume 2, take this 10-minute review.
Volume 1 Recap Quiz
Q1. What are the three components of the RL framework?
Q2. What assumption does Dynamic Programming require that Monte Carlo does not?
Q3. What is the difference between policy evaluation and value iteration?
Q4. Why can't we just run value iteration on a real robot?
Q5. What is the main limitation of tabular Dynamic Programming?
What Changes in Volume 2
| Volume 1 (DP) | Volume 2 (Model-free) | |
|---|---|---|
| Model required? | Yes | No |
| How values are estimated | Exact Bellman sweeps | Sampled episodes / transitions |
| Convergence | Exact (given model) | In expectation (with enough data) |
| Key algorithms | Policy eval, Policy iter, Value iter | Monte Carlo, TD(0), SARSA, Q-learning |
| Bootstrapping | Yes (full backup) | Monte Carlo: No. TD: Yes |
The big insight: Monte Carlo replaces the expectation over transitions with the sample return from one episode. TD methods go further — they bootstrap (use current estimates) so they can update after every step, not just at the end of an episode.
Bridge Exercise
You implemented policy evaluation using the known transition model. Now imagine you don’t have the model — you can only run episodes.
Modify the following to use sample episodes instead of Bellman sweeps:
Solution
| |
What changed: Instead of computing V using transition probabilities, you sampled actual trajectories and averaged returns. That is Monte Carlo prediction — the first model-free method in Volume 2.
Ready for Volume 2?
Before continuing, confirm:
- I can write the Bellman equation from memory.
- I understand why DP needs a model.
- I implemented policy evaluation (or followed the code closely).
- I understand the bridge exercise above — averaging sample returns to estimate V.