Volume 9 Review & Bridge to Volume 10

Volume 9 Recap Quiz (5 questions)

Q1. What is a Nash Equilibrium, and why is it hard to find in multi-agent RL?

A Nash Equilibrium is a joint policy (π₁*, π₂*, …, πₙ*) where no single agent can improve its reward by unilaterally changing its policy (given the others are fixed). It’s hard to find in MARL because: (1) each agent’s environment is non-stationary (other agents are learning simultaneously); (2) there may be multiple Nash equilibria; (3) gradient-based methods may cycle or diverge in competitive settings; (4) the joint action space is exponential in the number of agents.

Q2. What is IQL (Independent Q-Learning) and why does it often work despite being theoretically flawed?

IQL treats each agent as an independent learner — each has its own Q-network and ignores other agents’ actions/policies. Theoretically, the environment is non-stationary from each agent’s perspective, so convergence guarantees break. In practice it often works because: (1) the other agents’ policies change slowly; (2) the team reward provides enough signal; (3) it’s simple and parallelisable. It fails when tight coordination is needed.

Q3. What is CTDE (Centralised Training, Decentralised Execution)?

During training: agents have access to global state, other agents’ actions/observations, and can share gradients — this enables coordination. During execution: each agent acts using only its own local observations (no communication or global state needed). This is practical for real-world deployment where agents are distributed. QMIX, MADDPG, and MAPPO all follow CTDE.

Q4. How does QMIX enforce monotonic mixing without losing expressiveness?

QMIX factorises the joint Q-function as Q_tot = f(Q₁(τ₁,a₁), …, Qₙ(τₙ,aₙ)) where f is a monotonically increasing function (with non-negative weights from a hypernetwork). This guarantees that argmax Q_tot = (argmax Q₁, …, argmax Qₙ) — decentralised greedy is optimal. QMIX doesn’t need joint action enumeration, making it scalable. The monotonicity constraint is limiting for non-cooperative settings.

Q5. What is the difference between cooperative, competitive, and mixed multi-agent settings?

Cooperative: all agents share the same reward. Goal: maximise joint return. Algorithms: QMIX, MAPPO, QPLEX.
Competitive (zero-sum): one agent’s gain is another’s loss. Game tree search, self-play (AlphaZero), Nash Q-learning.
Mixed (general-sum): agents have different reward functions, partially aligned. Most real-world settings. Requires general game-theoretic approaches; no single dominant paradigm.

What Changes in Volume 10

	Volume 9 (Academic / Game Settings)	Volume 10 (Real-World Deployment)
Environment	Simulated game / benchmark	Physical world, production systems, LLMs
Safety	Not a concern (reset easily)	Critical — unsafe actions have real consequences
Reward	Well-defined (game score)	Ambiguous — requires human feedback or IRL
Distribution	Stationary at test time	Distributional shift, adversarial inputs
Interpretability	Optional	Often legally / ethically required
Scale	Hundreds of agents	Billions of parameters (LLM fine-tuning)

The big insight: RL in the real world requires safety constraints, interpretable policies, robust behaviour under distribution shift, and alignment with human preferences. RLHF (Reinforcement Learning from Human Feedback) is the dominant method for aligning large language models — it is PPO applied to token generation with a learned human preference reward model.

Bridge Exercise: Reward Hacking — The Alignment Problem in Miniature

Try it — edit and run (Shift+Enter)

Next: Volume 10: Real-World RL, Safety & LLM Alignment

Volume 9 Recap Quiz (5 questions)#

What Changes in Volume 10#

Bridge Exercise: Reward Hacking — The Alignment Problem in Miniature#

Volume 9 Recap Quiz (5 questions)

What Changes in Volume 10

Bridge Exercise: Reward Hacking — The Alignment Problem in Miniature