Volume 7 Recap Quiz (5 questions)

Q1. How does ICM (Intrinsic Curiosity Module) generate an intrinsic reward?
ICM has two networks: (1) a forward model that predicts the next feature state given (s_t, a_t); (2) an inverse model that predicts action a_t from (φ(s_t), φ(s_{t+1})). The intrinsic reward is the prediction error of the forward model: r_i = ||φ̂(s_{t+1}) − φ(s_{t+1})||². The agent is “curious” about states where its forward model is surprised. The inverse model learns features that are sensitive to agent-controllable aspects of the environment (ignoring uncontrollable noise like TV static).
Q2. What problem does RND (Random Network Distillation) solve that ICM does not?
ICM can get “distracted” by stochastic or uncontrollable parts of the environment (the “noisy TV” problem — a random TV always produces high prediction error). RND avoids this: it trains a small network to predict the output of a fixed random target network on each state. Novel states produce high prediction error; visited states produce low error. Because the target is deterministic, stochastic noise doesn’t create spuriously high novelty.
Q3. What is the two-phase approach of Go-Explore?
Phase 1 (Explore): maintain an archive of interesting states; repeatedly return to a promising archived state (ignoring any learned policy — direct reset) and explore randomly from there. This detaches exploration from the current policy. Phase 2 (Robustify): once a path to the goal is found, train a policy to reliably follow it using imitation learning. This separation allows finding solutions to hard games that confound end-to-end RL.
Q4. What does count-based exploration do, and why doesn't it scale?
Count-based methods maintain N(s) — the visit count for each state — and add bonus r_+ = β / √N(s). Well-visited states get small bonuses; novel states get large bonuses. This works well in small tabular MDPs but fails in large or continuous state spaces where almost every state is unique (N(s)=1 always). Pseudo-counts and density models extend this idea to high-dimensional spaces.
Q5. What is MAML (Model-Agnostic Meta-Learning) and what is its goal in RL?
MAML finds an initial parameter θ such that a few gradient steps on a new task yields good performance. It explicitly optimises for fast adaptability: θ* = argmax_θ E_τ[L(θ − α ∇L(θ, D_τ))]. In RL, each “task” is a different environment configuration. At test time, the agent can adapt to a new task with just 1–3 episodes. This is different from standard RL which trains from scratch per environment.

What Changes in Volume 8

Volume 7 (Online RL — Interactive)Volume 8 (Offline RL — Fixed Dataset)
Data collectionAgent interacts with environmentDataset is fixed — no new interaction
ExplorationCore challengeNot possible (fixed data)
Distribution shiftManageable (agent controls policy)Critical — OOD actions can be catastrophic
Key riskGetting stuck in local optimaOverestimating Q for unvisited (s,a)
Key methodsICM, RND, Go-ExploreBCQ, CQL, Decision Transformer, IRL

The big insight: Sometimes you have a large logged dataset (from humans or previous policies) but cannot run new experiments — medical devices, autonomous vehicles, expensive robots. Offline RL learns from this fixed data without any environment interaction. The challenge: the Q-function may wildly overestimate value for actions never seen in the dataset (“out-of-distribution” actions).


Bridge Exercise: The Offline RL Distribution Shift Problem

Try it — edit and run (Shift+Enter)

Next: Volume 8: Offline RL & Imitation Learning