Curriculum

Chapter 77: Generative Adversarial Imitation Learning (GAIL)

Learning objectives Implement GAIL: train a discriminator D(s, a) to distinguish state-action pairs from the expert vs from the current policy; use the discriminator output (or log D) as reward for a policy gradient method. Train the policy to maximize the discriminator reward (i.e. to fool the discriminator) while the discriminator tries to tell expert from agent. Test on a simple task (e.g. CartPole or MuJoCo) and compare imitation quality with behavioral cloning. Explain the connection to GANs: the policy is the generator, the discriminator gives the learning signal. Relate GAIL to robot navigation and game AI where we have expert demos and want to match the expert distribution without hand-designed rewards. Concept and real-world RL ...

Chapter 78: Adversarial Motion Priors (AMP)

Learning objectives Read the AMP paper and explain how it combines a task reward (e.g. velocity tracking, goal reaching) with an adversarial style reward (discriminator that scores motion similarity to reference data). Write the combined reward function: r = r_task + λ r_style, where r_style comes from a discriminator trained to distinguish agent motion from reference (e.g. motion capture) data. Identify why adding a style reward helps produce natural-looking and robust locomotion compared to task-only reward. Relate AMP to robot navigation and game AI (character animation) where we want both task success and natural motion. Concept and real-world RL ...

Chapter 79: Offline-to-Online Finetuning

Learning objectives Pretrain an SAC (or similar) agent offline on a fixed dataset (e.g. from a mix of policies or from Chapter 71). Finetune the agent online by continuing training with environment interaction. Compare the learning curve (return vs steps) of finetuning from offline pretraining vs training from scratch. Implement a Q-filter: when updating the policy, avoid or downweight updates that use actions for which Q is below a threshold (to avoid reinforcing “bad” actions that could destabilize the policy). Relate offline-to-online to recommendation (pretrain on logs, then A/B test) and healthcare (pretrain on historical data, then cautious online updates). Concept and real-world RL ...

Chapter 80: RL from Human Feedback (RLHF) Basics

Learning objectives Implement a Bradley-Terry model to learn a reward function from pairwise comparisons of two trajectories (or segments): given (τ^w, τ^l) meaning “τ^w is preferred over τ^l,” fit r so that E[r(τ^w)] > E[r(τ^l)]. Use the learned reward to train a policy with PPO (or another policy gradient method): maximize expected return under r. Explain the RLHF pipeline: collect preferences → train reward model → train policy on reward model. Test on a simple environment with simulated preferences (e.g. prefer longer/higher-return trajectories) and verify the policy improves. Relate RLHF to dialogue (prefer helpful/harmless responses) and recommendation (prefer engaging content). Concept and real-world RL ...

Chapter 81: Multi-Agent Fundamentals

Learning objectives Model a two-player zero-sum game (e.g. Rock-Paper-Scissors) as a Dec-POMDP (Decentralized Partially Observable MDP) or equivalent multi-agent framework. Define states, observations, actions, and rewards for each agent in the game. Explain the difference between centralized (one controller sees everything) and decentralized (each agent has its own observation and policy) formulations. Identify how the same game can be viewed as a normal-form game (payoff matrix) and as a sequential Dec-POMDP (if we add structure). Relate multi-agent modeling to game AI (opponents, teammates) and trading (multiple market participants). Concept and real-world RL ...

Chapter 82: Game Theory Basics for RL

Learning objectives Compute the Nash equilibrium of a simple 2×2 game (e.g. Prisoner’s Dilemma) from the payoff matrix. Explain why independent learning (each agent learns its best response without knowing the other’s policy) might converge to an outcome that is not a Nash equilibrium, or might not converge at all. Compare Nash equilibrium payoffs with the payoffs that result from independent Q-learning or gradient-based learning in the same game. Identify the difference between cooperative, competitive, and mixed settings in terms of payoff structure. Relate game theory to game AI (opponent modeling) and trading (market equilibrium). Concept and real-world RL ...

Chapter 83: Independent Q-Learning (IQL)

Learning objectives Implement independent Q-learning (IQL) in a simple cooperative game (e.g. two agents must “meet” in the same cell or coordinate to achieve a joint goal). Observe the non-stationarity problem: as one agent’s policy changes, the transition and reward from the other agent’s perspective change, so the environment appears non-stationary. Explain why IQL can still work in some cooperative settings despite non-stationarity, and when it fails or converges slowly. Compare IQL with a baseline (e.g. random or hand-coded coordination) on the meet-up or similar task. Relate IQL and non-stationarity to game AI (teammates) and dialogue (multiple agents). Concept and real-world RL ...

Chapter 84: Centralized Training, Decentralized Execution (CTDE)

Learning objectives Explain the CTDE paradigm: during training, algorithms can use centralized information (e.g. global state, all agents’ actions) to learn better value functions or gradients; during execution, each agent uses only its local observation and policy (decentralized). Give a concrete example (e.g. QMIX, MADDPG, or a simple cooperative task) where the critic or value function uses global state and the actor uses only local observation. Explain why CTDE helps with non-stationarity: during training, the centralized critic sees the full state and other agents’ actions, so the environment from the critic’s perspective is “stationary” (we know the joint action); each agent’s policy can then be trained with this stable learning signal. Identify why decentralized execution is important for scalability and deployment (no need to communicate all observations at test time). Relate CTDE to game AI (team coordination) and robot navigation (multi-robot systems). Concept and real-world RL ...

Chapter 85: Multi-Agent DDPG (MADDPG)

Learning objectives Implement MADDPG for the Multi-Agent Particle Environment (e.g. “simple spread”): each agent has a decentralized actor (policy π_i(o_i) or π_i(s_i)) and a centralized critic Q_i(s, a_1,…,a_n) that takes the full state and all actions. Train the critics with TD targets using (s, a_1,…,a_n) and the actors with the gradient of Q_i w.r.t. agent i’s action (DDPG-style). Explain why centralized critics help: each Q_i can use the full state and joint action, so the critic sees a stationary environment; the actor for agent i is updated to maximize Q_i(s, a_1,…,a_i,…,a_n) by changing a_i (with a_i = π_i(o_i) at execution). Run on “simple spread” (or similar) and report coordination behavior and return. Relate MADDPG to robot navigation (multi-robot) and game AI (cooperative or competitive). Concept and real-world RL ...

Chapter 86: Value Decomposition Networks (VDN)

Learning objectives Implement VDN: for a cooperative game, define joint Q as the sum of individual Q-values: Q_tot(s, a_1,…,a_n) = Q_1(o_1, a_1) + … + Q_n(o_n, a_n). Train with a joint reward (e.g. team reward): use TD on Q_tot so that the sum of individual Qs approximates the joint return; backprop distributes the gradient to each Q_i. Compare VDN with IQL (each agent trains Q_i on local reward or team reward without factorization) in terms of learning speed and final return. Explain the limitation of VDN: additivity may not hold for all tasks (e.g. when there are strong synergies or redundancies between agents). Relate VDN to game AI (team games) and robot navigation (multi-robot coordination). Concept and real-world RL ...