Volume 8 Review & Bridge to Volume 9

Volume 8 Recap Quiz (5 questions)

Q1. What is behavioral cloning (BC) and what is its main failure mode?

BC treats imitation learning as supervised learning: train a policy π(a|s) to minimize cross-entropy loss against expert actions. Simple and effective when the dataset is large and diverse. Main failure mode: covariate shift (compounding errors). A small mistake moves the agent to a state not seen in the expert data; the policy hasn’t learned to recover, making further mistakes. After T steps, error can grow as O(T²).

Q2. How does DAgger fix the covariate shift problem?

DAgger (Dataset Aggregation) iteratively: (1) run the current policy to collect states the learner actually visits; (2) query the expert for correct actions at those states; (3) aggregate the new data into the training set; (4) retrain. By training on the distribution of states the learner visits (not just expert trajectories), DAgger achieves O(T) error instead of O(T²). It requires an interactive expert.

Q3. What is Inverse Reinforcement Learning (IRL), and when is it better than BC?

IRL infers the reward function R(s,a) that explains expert behaviour, then solves the RL problem with that inferred reward. Better than BC when: (1) you want the agent to generalise to new environments (BC copies actions; IRL recovers goals); (2) the expert data is sparse or suboptimal; (3) you want to transfer the policy to a different body/dynamics. Drawback: IRL is ill-posed (many rewards explain the same behaviour) and computationally expensive.

Q4. Describe the RLHF pipeline for LLM alignment (3 stages).

Supervised Fine-Tuning (SFT): fine-tune the base LLM on high-quality human-written demonstrations.
Reward Model Training: collect human preference comparisons (response A vs B); train a reward model R(prompt, response) to predict human preference scores.
RL Fine-Tuning with PPO: use PPO to optimise the LLM policy to maximise R, with a KL penalty against the SFT model to prevent reward hacking and mode collapse.

Q5. What is Conservative Q-Learning (CQL) and why is it needed for offline RL?

CQL adds a penalty to the standard Bellman loss that lowers Q-values for out-of-distribution (OOD) actions while raising them for in-distribution actions. Formally it adds: α · (E_{a~~μ}[Q(s,a)] − E_{a~~β}[Q(s,a)]) to the loss. This prevents the Q-function from overestimating value for actions never seen in the dataset. BCQ achieves similar goals by constraining actions to stay close to the behaviour policy.

What Changes in Volume 9

	Volume 8 (Single Agent)	Volume 9 (Multi-Agent)
Environment	One agent, stationary world	Multiple agents, each affecting others
Stationarity	Environment is fixed	Non-stationary — other agents are learning
Solution concept	Optimal policy	Nash equilibrium
Credit assignment	Straightforward	Hard — joint reward, individual actions
Key challenge	OOD actions, distribution shift	Non-stationarity, communication, emergent behaviour

The big insight: With multiple agents, each agent’s optimal policy depends on what the others do — and they’re all changing simultaneously. This breaks the Markov assumption for any single agent. Game theory (Nash equilibrium, zero-sum, cooperative) provides the right framework. CTDE (Centralised Training, Decentralised Execution) is the dominant paradigm: train with global info, deploy with local observations.

Bridge Exercise: The Prisoner’s Dilemma — Game Theory in Action

Try it — edit and run (Shift+Enter)

Next: Volume 9: Multi-Agent RL

Volume 8 Recap Quiz (5 questions)#

What Changes in Volume 9#

Bridge Exercise: The Prisoner’s Dilemma — Game Theory in Action#

Volume 8 Recap Quiz (5 questions)

What Changes in Volume 9

Bridge Exercise: The Prisoner’s Dilemma — Game Theory in Action