Curriculum

Chapter 87: QMIX Algorithm

Learning objectives Implement QMIX: a mixing network that takes agent Q-values (Q_1,…,Q_n) and the global state s and outputs joint Q_tot, with monotonicity constraint ∂Q_tot/∂Q_i ≥ 0 so that argmax over joint action decomposes to per-agent argmax. Enforce monotonicity by generating mixing weights with hypernetworks that take s and output positive weights (e.g. absolute value of network outputs). Train with TD on Q_tot using the joint reward; backprop through the mixing network to update both mix weights and individual Q_i. Test on a cooperative task and compare with VDN and IQL. Relate QMIX to game AI (StarCraft, team coordination) and robot navigation (multi-robot). Concept and real-world RL ...

Chapter 88: Multi-Agent PPO (MAPPO)

Learning objectives Adapt a PPO implementation to the multi-agent setting with parameter sharing: all agents use the same policy network π(a_i | o_i) (and optionally the same value function), with agent identity or observation distinguishing them. Use a centralized value function V(s_global) or V(s_global, a_1,…,a_n) to reduce variance and improve credit assignment; the policy remains decentralized π_i(a_i | o_i). Train on a collaborative task (e.g. particle env or simple grid) and compare with IPPO (Independent PPO: each agent runs PPO with its own parameters and no centralized value). Explain the benefits of parameter sharing (sample efficiency, symmetry) and centralized value (better baseline, stability). Relate MAPPO to game AI (team games) and robot navigation (homogeneous multi-robot). Concept and real-world RL ...

Chapter 89: Self-Play and League Training

Learning objectives Implement self-play in a simple game (e.g. Tic-Tac-Toe): two copies of the same agent (or two agents with shared or separate parameters) play against each other; update the policy from the outcomes. Update both agents (or the single policy) so that they improve against the current opponent (themselves). Track an ELO rating (or win rate vs a fixed baseline) as training progresses to measure improvement. Explain why self-play can lead to stronger policies (the opponent is always at the current level) and potential pitfalls (cycling, forgetting past strategies). Relate self-play to game AI (AlphaGo, Dota) and dialogue (negotiation, debate). Concept and real-world RL ...

Chapter 90: Communication in MARL

Learning objectives Implement a simple communication protocol: each agent outputs a message (e.g. a vector) in addition to its action; the message is fed into other agents’ policies (e.g. as part of their observation at the next step). Train agents to solve a task that requires coordination (e.g. two agents must swap positions or colors, or meet at a target) using this communication. Compare with the same task without communication (each agent sees only local observation) and report improvement in return or success rate. Explain how learned communication can encode information (e.g. “I am going left”) that helps coordination. Relate communication in MARL to dialogue (multi-turn interaction) and robot navigation (multi-robot signaling). Concept and real-world RL ...

Chapter 91: RL in Robotics

Learning objectives Train a policy in simulation (e.g. robotic arm reaching or locomotion) using a standard RL algorithm (e.g. PPO or SAC). Apply domain randomization: vary physics parameters (e.g. mass, friction, motor gains) during training so the policy sees a distribution of sim environments. Attempt to deploy the policy in a real-world setting (or a different sim with “real” parameters) and evaluate the sim-to-real gap (drop in performance or need for adaptation). Explain why domain randomization can improve transfer: the policy becomes robust to parameter variation and may generalize to the real world. Relate sim-to-real and domain randomization to robot navigation and healthcare (safety-critical deployment). Concept and real-world RL ...

Chapter 92: Safe Reinforcement Learning

Learning objectives Formulate a constrained MDP for a self-driving car (or similar): maximize progress (or reward) while keeping collisions (or another cost) below a threshold. Implement a Lagrangian method: add a penalty term λ * (constraint violation) to the objective and update the penalty coefficient λ (e.g. increase λ when the constraint is violated) so that the policy satisfies the constraint. Explain the trade-off: higher λ pushes the policy to satisfy the constraint but may reduce task reward; tune λ or use dual ascent. Evaluate the policy: report task return and constraint cost (e.g. number of collisions per episode); verify the constraint is met in evaluation. Relate safe RL and constrained MDPs to healthcare (safety constraints) and trading (risk limits). Concept and real-world RL ...

Chapter 93: RL for Algorithmic Trading

Learning objectives Simulate a simple stock market with one asset (e.g. price follows a random walk or a simple mean-reverting process). Design an MDP: state = (price, position, cash, or features); actions = buy / sell / hold (possibly with size); reward = profit (or risk-adjusted return). Train an agent (e.g. DQN or PPO) on this MDP and evaluate its Sharpe ratio (mean return / std return over episodes or over time). Discuss risk management: position limits, drawdown, transaction costs; how the reward and state design affect behavior. Relate the exercise to trading and finance anchor scenarios (state = market + portfolio, action = trade, reward = profit or Sharpe). Concept and real-world RL ...

Chapter 94: RL in Recommender Systems

Learning objectives Build a toy recommender: 100 items, a user model with changing preferences (e.g. latent state that drifts or has context-dependent taste). Define state (e.g. user history, current context), action (which item to show), and reward (e.g. click, watch time, or engagement score). Train an agent with a policy gradient method (e.g. REINFORCE or PPO) to maximize long-term engagement (e.g. cumulative clicks or cumulative reward over a session). Compare with a baseline (e.g. random or greedy to current preference) and report engagement over episodes. Relate the formulation to the recommendation anchor (state = user context, action = item, return = long-term satisfaction). Concept and real-world RL ...

Chapter 95: Training Large Language Models with PPO

Learning objectives Implement a PPO loop to fine-tune a small language model (e.g. GPT-2 small or DistilGPT-2) for text generation with a simple reward (e.g. positive sentiment, or length). Include a KL penalty (or KL constraint) so that the updated policy does not deviate too far from the initial (reference) policy, preventing mode collapse and maintaining fluency. Generate sequences with the current policy, compute reward for each sequence, and update the policy with PPO (clip + KL). Observe that without KL penalty the policy may collapse (e.g. always output the same high-reward token); with KL it stays diverse. Relate to dialogue and RLHF: same PPO+KL setup is used for aligning LMs with human preferences. Concept and real-world RL ...

Chapter 96: Implementing RLHF in NLP

Learning objectives Collect (or simulate) human preference data: pairs of model responses to the same prompt, with a label indicating which response is preferred. Train a reward model using the Bradley-Terry loss: P(τ^w preferred over τ^l) = σ(r(τ^w) - r(τ^l)), where r is the reward model (e.g. LM that outputs a scalar or a separate head). Fine-tune the language model with PPO using the learned reward model as the reward (and a KL penalty to the initial LM). Evaluate on held-out prompts: generate with the fine-tuned LM and score with the reward model; optionally compare with the initial LM. Relate to the dialogue anchor and real RLHF pipelines (InstructGPT, etc.). Concept and real-world RL ...