Volume 6 Recap Quiz (5 questions)
Q1. What are the four phases of MCTS, and what happens in each?
- Selection: traverse the tree from root using UCB1 (balance exploit/explore) until a leaf.
- Expansion: add one or more child nodes to the leaf.
- Simulation (rollout): play out randomly (or with a fast policy) until a terminal state.
- Backpropagation: update win counts / values along the path from leaf to root.
AlphaZero replaces random rollouts with a learned value network, eliminating phase 3.
Q2. How does Dyna-Q combine model-free and model-based learning?
After each real interaction (s,a,r,s’), Dyna-Q: (1) updates Q directly (model-free TD), (2) updates the learned model M(s,a) → (r,s’), then (3) performs k planning steps: sample (s,a) from memory, query M, and update Q again with synthetic transitions. This reuses real experience multiple times, improving sample efficiency.
Q3. What is the 'compounding error' problem in model-based RL?
A learned model M(s,a) has some prediction error ε per step. Over an n-step rollout, errors compound: the agent may end up in regions of state space the model has never seen (distribution shift), producing wildly incorrect predictions. This limits how far ahead you can reliably plan with a learned model. Short rollouts (MBPO: 1–5 steps) mitigate this at the cost of reduced planning horizon.
Q4. What is a world model, and how does Dreamer use it?
A world model learns a compressed latent representation of environment dynamics: a recurrent state model s_{t+1} ~ p(s_{t+1}|s_t, a_t), a reward model r ~ p(r|s_t), and a decoder. Dreamer trains the policy entirely inside the model’s imagination — generating multi-step latent rollouts without any real environment interaction. Real data is only used to update the world model itself.
Q5. What is the key assumption that model-based methods make that exploration challenges?
Model-based methods assume the learned model is accurate across the state space the agent will visit. But if the agent only visits states near the start, the model is only accurate there. Exploration is needed to visit diverse states so the model generalises. Without good exploration, the agent may optimise against a model that is wildly wrong in unvisited regions — exploitation of model errors.
What Changes in Volume 7
| Volume 6 (Model-Based, Dense Rewards) | Volume 7 (Hard Exploration, Sparse Rewards) | |
|---|---|---|
| Reward signal | Assumed frequent / dense | Sparse or deceptive — rare signal |
| Exploration | ε-greedy or entropy bonus suffice | Dedicated exploration: ICM, RND, Go-Explore |
| State coverage | Incidental | Actively maximised (count-based, novelty) |
| Challenge | Model accuracy / planning horizon | Finding reward at all |
| Examples | MuJoCo locomotion | Montezuma’s Revenge, maze navigation |
The big insight: When rewards are sparse, the agent may never stumble upon a positive signal by random exploration. Intrinsic motivation — curiosity, novelty, prediction error — provides a dense internal reward signal that drives exploration independent of the task reward.
Bridge Exercise: How Quickly Does Random Exploration Fail?
Try it — edit and run (Shift+Enter)