Chapter 81: Multi-Agent Fundamentals

Learning objectives Model a two-player zero-sum game (e.g. Rock-Paper-Scissors) as a Dec-POMDP (Decentralized Partially Observable MDP) or equivalent multi-agent framework. Define states, observations, actions, and rewards for each agent in the game. Explain the difference between centralized (one controller sees everything) and decentralized (each agent has its own observation and policy) formulations. Identify how the same game can be viewed as a normal-form game (payoff matrix) and as a sequential Dec-POMDP (if we add structure). Relate multi-agent modeling to game AI (opponents, teammates) and trading (multiple market participants). Concept and real-world RL ...

March 10, 2026 · 4 min · 673 words · codefrydev

Chapter 82: Game Theory Basics for RL

Learning objectives Compute the Nash equilibrium of a simple 2×2 game (e.g. Prisoner’s Dilemma) from the payoff matrix. Explain why independent learning (each agent learns its best response without knowing the other’s policy) might converge to an outcome that is not a Nash equilibrium, or might not converge at all. Compare Nash equilibrium payoffs with the payoffs that result from independent Q-learning or gradient-based learning in the same game. Identify the difference between cooperative, competitive, and mixed settings in terms of payoff structure. Relate game theory to game AI (opponent modeling) and trading (market equilibrium). Concept and real-world RL ...

March 10, 2026 · 4 min · 672 words · codefrydev

Chapter 84: Centralized Training, Decentralized Execution (CTDE)

Learning objectives Explain the CTDE paradigm: during training, algorithms can use centralized information (e.g. global state, all agents’ actions) to learn better value functions or gradients; during execution, each agent uses only its local observation and policy (decentralized). Give a concrete example (e.g. QMIX, MADDPG, or a simple cooperative task) where the critic or value function uses global state and the actor uses only local observation. Explain why CTDE helps with non-stationarity: during training, the centralized critic sees the full state and other agents’ actions, so the environment from the critic’s perspective is “stationary” (we know the joint action); each agent’s policy can then be trained with this stable learning signal. Identify why decentralized execution is important for scalability and deployment (no need to communicate all observations at test time). Relate CTDE to game AI (team coordination) and robot navigation (multi-robot systems). Concept and real-world RL ...

March 10, 2026 · 4 min · 754 words · codefrydev

Chapter 90: Communication in MARL

Learning objectives Implement a simple communication protocol: each agent outputs a message (e.g. a vector) in addition to its action; the message is fed into other agents’ policies (e.g. as part of their observation at the next step). Train agents to solve a task that requires coordination (e.g. two agents must swap positions or colors, or meet at a target) using this communication. Compare with the same task without communication (each agent sees only local observation) and report improvement in return or success rate. Explain how learned communication can encode information (e.g. “I am going left”) that helps coordination. Relate communication in MARL to dialogue (multi-turn interaction) and robot navigation (multi-robot signaling). Concept and real-world RL ...

March 10, 2026 · 4 min · 729 words · codefrydev