Curriculum

Chapter 57: Dreamer and Latent Imagination

Learning objectives Implement a simplified Dreamer-style algorithm: train an RSSM-like model on collected trajectories, then roll out in latent space to train an actor-critic. Understand the imagination phase: no real env steps; only latent rollouts for policy updates. Relate to robot control and sample-efficient RL. Concept and real-world RL Dreamer learns a recurrent state-space model (RSSM) in latent space: encode observation to latent, predict next latent given action, predict reward and continue. The actor-critic is trained on imagined rollouts (latent only), so many gradient steps use no real env interaction. In robot navigation and game AI, this yields high sample efficiency. The key is training the model and the policy on the same data so the latent space is useful for control. ...

Chapter 58: Model-Based Policy Optimization (MBPO)

Learning objectives Implement MBPO: learn an ensemble of dynamics models, generate short rollouts from real states, add imagined transitions to the replay buffer, and train SAC on the combined buffer. Compare sample efficiency with SAC alone (same number of real env steps). Explain why short rollouts (e.g. 1–5 steps) help avoid compounding error. Concept and real-world RL MBPO (Model-Based Policy Optimization) uses learned dynamics to augment the replay buffer: from a real state, rollout the model for a few steps and add (s, a, r, s’) to the buffer. SAC (or another off-policy method) then trains on real + imagined data. Short rollouts keep model error manageable. In robot control and trading, MBPO can significantly reduce the number of real steps needed to reach good performance. ...

Chapter 59: Probabilistic Ensembles with Trajectory Sampling (PETS)

Learning objectives Implement PETS: an ensemble of probabilistic dynamics models (e.g. output mean and variance), and trajectory sampling (e.g. random shooting or CEM) to select actions via model predictive control (MPC). Use the model to evaluate action sequences and pick the best (no policy network). Apply to a continuous control task and compare with a policy-based method. Concept and real-world RL PETS uses an ensemble of probabilistic models to capture uncertainty; then at each step it samples many action sequences, rolls them out in the model, and chooses the sequence with the best predicted return (MPC). No policy network is trained; action selection is planning at test time. In robot control, MPC with learned models is used when we can afford computation at deployment; in trading, short-horizon planning with a learned model can improve decisions. ...

Chapter 60: Visualizing Model-Based Rollouts

Learning objectives For a learned dynamics model (e.g. from Chapter 52), sample a starting state and generate a rollout of predicted states for a fixed action sequence. Plot the true states (from the environment) and the predicted states (from the model) on the same axes to visualize compounding error. Interpret the plot: where does the model diverge from reality? Concept and real-world RL Visualizing model rollouts vs real rollouts makes compounding error concrete: small 1-step errors accumulate and the predicted trajectory drifts. In robot navigation and model-based RL, this motivates short rollouts, ensemble methods, and uncertainty-aware planning. The same idea applies to trading models (predictions diverge over time) and dialogue (conversation dynamics). ...

Chapter 61: The Hard Exploration Problem

Learning objectives Run DQN with ε-greedy on a sparse-reward environment (e.g. Montezuma’s Revenge if available, or a simple maze). Observe that the agent rarely discovers the first key (or goal) when rewards are sparse. Explain why sparse rewards cause failure: no learning signal until the goal is reached; random exploration is unlikely to reach it. Concept and real-world RL Hard exploration occurs when the reward is sparse (e.g. only at the goal): the agent gets no signal until it accidentally reaches the goal, which may require a long, specific sequence of actions. In game AI (Montezuma’s Revenge, Pitfall), ε-greedy DQN fails because random exploration almost never finds the key. In robot navigation and recommendation, sparse rewards (e.g. “user clicked” or “reached goal”) similarly make learning slow. This motivates intrinsic motivation, curiosity, and hierarchical methods. ...

Chapter 62: Intrinsic Motivation

Learning objectives Design an intrinsic reward based on state visitation count: bonus = \(1/\sqrt{\text{count}}\) (or similar) so rarely visited states are more attractive. Implement an agent that uses total reward = extrinsic + intrinsic and compare exploration behavior (e.g. coverage of the state space) with an agent that uses only extrinsic reward. Relate to curiosity and exploration in game AI and robot navigation. Concept and real-world RL Intrinsic motivation gives the agent a bonus for visiting novel or surprising states, so it explores even when extrinsic reward is sparse. Count-based bonus \(1/\sqrt{N(s)}\) (inverse square root of visit count) encourages visiting states that have been seen fewer times. In game AI and robot navigation, this can help discover the goal; in recommendation, novelty bonuses encourage diversity. The combination extrinsic + intrinsic balances exploitation (reward) and exploration (novelty). ...

Chapter 63: Curiosity-Driven Exploration (ICM)

Learning objectives Implement the Intrinsic Curiosity Module: a forward model that predicts next-state features from current state and action. Use prediction error (between predicted and actual next features) as intrinsic reward and combine it with A2C. Explain why prediction error encourages exploration in novel or stochastic parts of the state space. Compare exploration behavior (e.g. coverage, time to goal) with and without ICM on a sparse-reward maze. Relate curiosity-driven exploration to robot navigation and game AI where rewards are sparse. Concept and real-world RL ...

Chapter 64: Random Network Distillation (RND)

Learning objectives Implement RND: a fixed random target network and a predictor network that fits the target on visited states. Use prediction error (target output vs predictor output) as intrinsic reward for exploration. Explain why RND rewards novelty without learning a forward model of the environment. Apply RND to a hard exploration problem (e.g. Pitfall-style or sparse-reward maze) and compare with ε-greedy or count-based exploration. Relate RND to game AI and robot navigation where state spaces are large and rewards sparse. Concept and real-world RL ...

Chapter 65: Count-Based Exploration

Learning objectives Implement count-based exploration for discrete state spaces using a hash table and a bonus such as \(1/\sqrt{N(s)}\). Implement pseudo-counts from a density model (e.g. PixelCNN or simpler density estimator) for image-based states. Explain why pseudo-counts are needed when the state space is huge or continuous (e.g. Atari frames). Test count-based and pseudo-count exploration on a simple Atari-style or image-based task and compare exploration coverage. Relate count-based and pseudo-count methods to game AI and recommendation (e.g. diversity). Concept and real-world RL ...

Chapter 66: Go-Explore Algorithm

Learning objectives Implement a simplified Go-Explore: an archive of promising states and a strategy to return to them and explore further. Explain the two-phase idea: (1) archive states that lead to high rewards or novelty, (2) select from the archive, return to that state, then take exploratory actions. Compare Go-Explore with random exploration (e.g. episodes to reach goal, or maximum reward reached) on a deterministic maze. Identify why “return” (resetting to an archived state) helps in hard exploration compared to always starting from the initial state. Relate Go-Explore to game AI (e.g. Montezuma’s Revenge) and robot navigation with sparse goals. Concept and real-world RL ...