Chapter 81: Multi-Agent Fundamentals

Learning objectives Model a two-player zero-sum game (e.g. Rock-Paper-Scissors) as a Dec-POMDP (Decentralized Partially Observable MDP) or equivalent multi-agent framework. Define states, observations, actions, and rewards for each agent in the game. Explain the difference between centralized (one controller sees everything) and decentralized (each agent has its own observation and policy) formulations. Identify how the same game can be viewed as a normal-form game (payoff matrix) and as a sequential Dec-POMDP (if we add structure). Relate multi-agent modeling to game AI (opponents, teammates) and trading (multiple market participants). Concept and real-world RL ...

March 10, 2026 · 4 min · 673 words · codefrydev

Chapter 82: Game Theory Basics for RL

Learning objectives Compute the Nash equilibrium of a simple 2×2 game (e.g. Prisoner’s Dilemma) from the payoff matrix. Explain why independent learning (each agent learns its best response without knowing the other’s policy) might converge to an outcome that is not a Nash equilibrium, or might not converge at all. Compare Nash equilibrium payoffs with the payoffs that result from independent Q-learning or gradient-based learning in the same game. Identify the difference between cooperative, competitive, and mixed settings in terms of payoff structure. Relate game theory to game AI (opponent modeling) and trading (market equilibrium). Concept and real-world RL ...

March 10, 2026 · 4 min · 672 words · codefrydev

Chapter 83: Independent Q-Learning (IQL)

Learning objectives Implement independent Q-learning (IQL) in a simple cooperative game (e.g. two agents must “meet” in the same cell or coordinate to achieve a joint goal). Observe the non-stationarity problem: as one agent’s policy changes, the transition and reward from the other agent’s perspective change, so the environment appears non-stationary. Explain why IQL can still work in some cooperative settings despite non-stationarity, and when it fails or converges slowly. Compare IQL with a baseline (e.g. random or hand-coded coordination) on the meet-up or similar task. Relate IQL and non-stationarity to game AI (teammates) and dialogue (multiple agents). Concept and real-world RL ...

March 10, 2026 · 4 min · 715 words · codefrydev

Chapter 85: Multi-Agent DDPG (MADDPG)

Learning objectives Implement MADDPG for the Multi-Agent Particle Environment (e.g. “simple spread”): each agent has a decentralized actor (policy π_i(o_i) or π_i(s_i)) and a centralized critic Q_i(s, a_1,…,a_n) that takes the full state and all actions. Train the critics with TD targets using (s, a_1,…,a_n) and the actors with the gradient of Q_i w.r.t. agent i’s action (DDPG-style). Explain why centralized critics help: each Q_i can use the full state and joint action, so the critic sees a stationary environment; the actor for agent i is updated to maximize Q_i(s, a_1,…,a_i,…,a_n) by changing a_i (with a_i = π_i(o_i) at execution). Run on “simple spread” (or similar) and report coordination behavior and return. Relate MADDPG to robot navigation (multi-robot) and game AI (cooperative or competitive). Concept and real-world RL ...

March 10, 2026 · 4 min · 652 words · codefrydev

Chapter 86: Value Decomposition Networks (VDN)

Learning objectives Implement VDN: for a cooperative game, define joint Q as the sum of individual Q-values: Q_tot(s, a_1,…,a_n) = Q_1(o_1, a_1) + … + Q_n(o_n, a_n). Train with a joint reward (e.g. team reward): use TD on Q_tot so that the sum of individual Qs approximates the joint return; backprop distributes the gradient to each Q_i. Compare VDN with IQL (each agent trains Q_i on local reward or team reward without factorization) in terms of learning speed and final return. Explain the limitation of VDN: additivity may not hold for all tasks (e.g. when there are strong synergies or redundancies between agents). Relate VDN to game AI (team games) and robot navigation (multi-robot coordination). Concept and real-world RL ...

March 10, 2026 · 4 min · 684 words · codefrydev

Chapter 87: QMIX Algorithm

Learning objectives Implement QMIX: a mixing network that takes agent Q-values (Q_1,…,Q_n) and the global state s and outputs joint Q_tot, with monotonicity constraint ∂Q_tot/∂Q_i ≥ 0 so that argmax over joint action decomposes to per-agent argmax. Enforce monotonicity by generating mixing weights with hypernetworks that take s and output positive weights (e.g. absolute value of network outputs). Train with TD on Q_tot using the joint reward; backprop through the mixing network to update both mix weights and individual Q_i. Test on a cooperative task and compare with VDN and IQL. Relate QMIX to game AI (StarCraft, team coordination) and robot navigation (multi-robot). Concept and real-world RL ...

March 10, 2026 · 4 min · 664 words · codefrydev