Volume 8 Recap Quiz (5 questions)

Q1. What is behavioral cloning (BC) and what is its main failure mode?
BC treats imitation learning as supervised learning: train a policy π(a|s) to minimize cross-entropy loss against expert actions. Simple and effective when the dataset is large and diverse. Main failure mode: covariate shift (compounding errors). A small mistake moves the agent to a state not seen in the expert data; the policy hasn’t learned to recover, making further mistakes. After T steps, error can grow as O(T²).
Q2. How does DAgger fix the covariate shift problem?
DAgger (Dataset Aggregation) iteratively: (1) run the current policy to collect states the learner actually visits; (2) query the expert for correct actions at those states; (3) aggregate the new data into the training set; (4) retrain. By training on the distribution of states the learner visits (not just expert trajectories), DAgger achieves O(T) error instead of O(T²). It requires an interactive expert.
Q3. What is Inverse Reinforcement Learning (IRL), and when is it better than BC?
IRL infers the reward function R(s,a) that explains expert behaviour, then solves the RL problem with that inferred reward. Better than BC when: (1) you want the agent to generalise to new environments (BC copies actions; IRL recovers goals); (2) the expert data is sparse or suboptimal; (3) you want to transfer the policy to a different body/dynamics. Drawback: IRL is ill-posed (many rewards explain the same behaviour) and computationally expensive.
Q4. Describe the RLHF pipeline for LLM alignment (3 stages).
  1. Supervised Fine-Tuning (SFT): fine-tune the base LLM on high-quality human-written demonstrations.
  2. Reward Model Training: collect human preference comparisons (response A vs B); train a reward model R(prompt, response) to predict human preference scores.
  3. RL Fine-Tuning with PPO: use PPO to optimise the LLM policy to maximise R, with a KL penalty against the SFT model to prevent reward hacking and mode collapse.
Q5. What is Conservative Q-Learning (CQL) and why is it needed for offline RL?
CQL adds a penalty to the standard Bellman loss that lowers Q-values for out-of-distribution (OOD) actions while raising them for in-distribution actions. Formally it adds: α · (E_{aμ}[Q(s,a)] − E_{aβ}[Q(s,a)]) to the loss. This prevents the Q-function from overestimating value for actions never seen in the dataset. BCQ achieves similar goals by constraining actions to stay close to the behaviour policy.

What Changes in Volume 9

Volume 8 (Single Agent)Volume 9 (Multi-Agent)
EnvironmentOne agent, stationary worldMultiple agents, each affecting others
StationarityEnvironment is fixedNon-stationary — other agents are learning
Solution conceptOptimal policyNash equilibrium
Credit assignmentStraightforwardHard — joint reward, individual actions
Key challengeOOD actions, distribution shiftNon-stationarity, communication, emergent behaviour

The big insight: With multiple agents, each agent’s optimal policy depends on what the others do — and they’re all changing simultaneously. This breaks the Markov assumption for any single agent. Game theory (Nash equilibrium, zero-sum, cooperative) provides the right framework. CTDE (Centralised Training, Decentralised Execution) is the dominant paradigm: train with global info, deploy with local observations.


Bridge Exercise: The Prisoner’s Dilemma — Game Theory in Action

Try it — edit and run (Shift+Enter)

Next: Volume 9: Multi-Agent RL