Curriculum

Chapter 67: Meta-Learning (Learning to Learn)

Learning objectives Define a distribution of tasks (e.g. different goal positions in a gridworld) and sample tasks for meta-training. Implement a meta-training loop: for each task, collect data or run a few steps of adaptation, then update the meta-policy or meta-parameters to improve few-task performance. Explain the goal of meta-RL: learn an initialization or algorithm that adapts quickly to new tasks with few gradient steps or few episodes. Evaluate the meta-learned policy on held-out tasks with limited data and compare with training from scratch. Relate meta-RL to robot navigation (different goals or terrains) and game AI (different levels or opponents). Concept and real-world RL ...

Chapter 68: Model-Agnostic Meta-Learning (MAML) in RL

Learning objectives Implement MAML for a simple RL task: sample tasks (e.g. different target velocities), compute inner update (one or a few gradient steps on task loss), then meta-update using the post-adaptation loss. Compute the meta-gradient (gradient of the post-adaptation return or loss w.r.t. initial parameters), using second-order derivatives or a first-order approximation. Explain why MAML learns an initialization that is “easy to fine-tune” with one or few gradient steps. Train a policy that adapts in one gradient step to a new task and evaluate on held-out tasks. Relate MAML to robot navigation (e.g. different terrains or payloads) and game AI (different levels). Concept and real-world RL ...

Chapter 69: RL² (Reinforcement Learning as an RNN)

Learning objectives Implement RL²: an RNN policy whose input at each step is (state, action, reward, done) from the previous step (and current state), and whose output is the action. Explain how the RNN hidden state can encode the “learning algorithm” or belief about the task from the history of experience. Train the RNN on multiple POMDP tasks (or tasks with different dynamics/rewards) so that it learns to adapt its behavior from the history. Evaluate the trained policy on new POMDP tasks and compare with a non-recurrent policy. Relate RL² to dialogue (context-dependent response) and game AI (adapting to different levels or opponents). Concept and real-world RL ...

Chapter 70: Unsupervised Environment Design

Learning objectives Implement a simple PAIRED-style setup: an adversary that designs a maze (or environment) to minimize the agent’s return, and an agent that learns to solve mazes. Train both adversary and agent in a loop: adversary proposes a maze, agent attempts to solve it, update adversary to make the maze harder and agent to improve. Explain how unsupervised environment design can produce a curriculum of tasks without hand-designed levels. Compare agent performance on adversary-generated mazes vs fixed or random mazes. Relate PAIRED to game AI (procedural level generation) and robot navigation (training on diverse scenarios). Concept and real-world RL ...

Chapter 71: The Offline RL Problem

Learning objectives Collect a dataset of transitions (state, action, reward, next_state, done) from a random policy (or fixed behavior policy) in the Hopper environment. Train a standard SAC agent offline (no environment interaction) on this dataset and observe the overestimation of Q-values for out-of-distribution (OOD) actions. Explain why naive off-policy methods fail in offline RL: the policy is trained to maximize Q, but Q is only trained on in-distribution actions; for OOD actions Q can be overestimated. Identify the distributional shift between the behavior policy (that collected the data) and the learned policy. Relate the offline RL problem to recommendation and healthcare where data comes from logs or historical trials. Concept and real-world RL ...

Chapter 72: Conservative Q-Learning (CQL)

Learning objectives Implement the CQL loss: add a term that penalizes Q-values for actions drawn from the current policy (or a uniform distribution) so that Q is lower for out-of-distribution actions. Apply CQL to the offline dataset from Chapter 71 and train an offline SAC (or similar) with the CQL regularizer. Compare the learned policy’s evaluation return and Q-values with naive SAC on the same dataset. Explain why penalizing Q for OOD actions helps avoid overestimation and improves offline performance. Relate CQL to recommendation and healthcare where we must learn from fixed logs without overestimating unseen actions. Concept and real-world RL ...

Chapter 73: Decision Transformers

Learning objectives Implement a Decision Transformer: a transformer (or GPT-style) model that takes sequences of (returns-to-go, state, action) and predicts actions conditioned on desired return (and past states/actions). Explain the formulation: at each timestep, input (R_t, s_t, a_{t-1}, R_{t-1}, s_{t-1}, …) where R_t is the return from t onward; the model predicts a_t. Training is supervised on offline trajectories. Train the model on a simple environment’s offline dataset and test by conditioning on different returns-to-go (e.g. high return for “expert” behavior). Compare with offline RL (e.g. CQL) in terms of implementation and how the policy is extracted (conditioning vs maximization). Relate Decision Transformers to recommendation (sequence of user-item-reward) and dialogue (conditioning on desired outcome). Concept and real-world RL ...

Chapter 74: Introduction to Imitation Learning

Learning objectives Collect expert demonstrations (state-action pairs or trajectories) from a trained PPO agent on LunarLander. Train a behavioral cloning (BC) agent: supervised learning to predict the expert’s action given the state. Evaluate the BC policy in the environment and compare its return and behavior to the expert. Explain the assumptions of behavioral cloning (i.i.d. states from the expert distribution) and when it works well. Relate imitation learning to robot navigation (learning from human demos) and dialogue (learning from human responses). Concept and real-world RL ...

Chapter 75: Limitations of Behavioral Cloning

Learning objectives Demonstrate the covariate shift problem: run the BC agent, record states it visits that were rare or absent in the expert data, and show that errors compound in those regions. Implement DAgger: collect new data by running the current BC policy (or a mix of expert and BC), query the expert for the correct action at those states, add to the dataset, and retrain BC. Explain why DAgger reduces covariate shift by adding on-policy (or mixed) states to the training set. Compare BC (trained only on expert data) with DAgger (iteratively aggregated) in terms of evaluation return and robustness. Relate covariate shift and DAgger to robot navigation and healthcare where the learner’s distribution can drift from the expert’s. Concept and real-world RL ...

Chapter 76: Inverse Reinforcement Learning (IRL)

Learning objectives Implement maximum entropy IRL: given expert trajectories, learn a reward function such that the expert’s policy (approximately) maximizes expected return under that reward. Use a linear reward model (e.g. r(s, a) = w^T φ(s, a)) and forward RL (e.g. value iteration or policy gradient) to compute the optimal policy for the current reward. Iterate between updating the reward to make the expert look better than other policies and solving the forward RL problem. Explain why IRL can recover a reward that explains the expert behavior and then generalize (e.g. to new states) better than pure BC in some settings. Relate IRL to robot navigation (recovering intent from demonstrations) and healthcare (inferring treatment objectives). Concept and real-world RL ...