Volume 6 Review & Bridge to Volume 7

Volume 6 Recap Quiz (5 questions)

Q1. What are the four phases of MCTS, and what happens in each?

Selection: traverse the tree from root using UCB1 (balance exploit/explore) until a leaf.
Expansion: add one or more child nodes to the leaf.
Simulation (rollout): play out randomly (or with a fast policy) until a terminal state.
Backpropagation: update win counts / values along the path from leaf to root.

AlphaZero replaces random rollouts with a learned value network, eliminating phase 3.

Q2. How does Dyna-Q combine model-free and model-based learning?

After each real interaction (s,a,r,s’), Dyna-Q: (1) updates Q directly (model-free TD), (2) updates the learned model M(s,a) → (r,s’), then (3) performs k planning steps: sample (s,a) from memory, query M, and update Q again with synthetic transitions. This reuses real experience multiple times, improving sample efficiency.

Q3. What is the 'compounding error' problem in model-based RL?

A learned model M(s,a) has some prediction error ε per step. Over an n-step rollout, errors compound: the agent may end up in regions of state space the model has never seen (distribution shift), producing wildly incorrect predictions. This limits how far ahead you can reliably plan with a learned model. Short rollouts (MBPO: 1–5 steps) mitigate this at the cost of reduced planning horizon.

Q4. What is a world model, and how does Dreamer use it?

A world model learns a compressed latent representation of environment dynamics: a recurrent state model s_{t+1} ~ p(s_{t+1}|s_t, a_t), a reward model r ~ p(r|s_t), and a decoder. Dreamer trains the policy entirely inside the model’s imagination — generating multi-step latent rollouts without any real environment interaction. Real data is only used to update the world model itself.

Q5. What is the key assumption that model-based methods make that exploration challenges?

Model-based methods assume the learned model is accurate across the state space the agent will visit. But if the agent only visits states near the start, the model is only accurate there. Exploration is needed to visit diverse states so the model generalises. Without good exploration, the agent may optimise against a model that is wildly wrong in unvisited regions — exploitation of model errors.

What Changes in Volume 7

	Volume 6 (Model-Based, Dense Rewards)	Volume 7 (Hard Exploration, Sparse Rewards)
Reward signal	Assumed frequent / dense	Sparse or deceptive — rare signal
Exploration	ε-greedy or entropy bonus suffice	Dedicated exploration: ICM, RND, Go-Explore
State coverage	Incidental	Actively maximised (count-based, novelty)
Challenge	Model accuracy / planning horizon	Finding reward at all
Examples	MuJoCo locomotion	Montezuma’s Revenge, maze navigation

The big insight: When rewards are sparse, the agent may never stumble upon a positive signal by random exploration. Intrinsic motivation — curiosity, novelty, prediction error — provides a dense internal reward signal that drives exploration independent of the task reward.

Bridge Exercise: How Quickly Does Random Exploration Fail?

Try it — edit and run (Shift+Enter)

Next: Volume 7: Exploration & Intrinsic Motivation

Volume 6 Recap Quiz (5 questions)#

What Changes in Volume 7#

Bridge Exercise: How Quickly Does Random Exploration Fail?#

Volume 6 Recap Quiz (5 questions)

What Changes in Volume 7

Bridge Exercise: How Quickly Does Random Exploration Fail?