Chapter 82: Game Theory Basics for RL

Learning objectives

Compute the Nash equilibrium of a simple 2×2 game (e.g. Prisoner’s Dilemma) from the payoff matrix.
Explain why independent learning (each agent learns its best response without knowing the other’s policy) might converge to an outcome that is not a Nash equilibrium, or might not converge at all.
Compare Nash equilibrium payoffs with the payoffs that result from independent Q-learning or gradient-based learning in the same game.
Identify the difference between cooperative, competitive, and mixed settings in terms of payoff structure.
Relate game theory to game AI (opponent modeling) and trading (market equilibrium).

Concept and real-world RL

Game theory provides solution concepts (e.g. Nash equilibrium) for multi-agent settings: at Nash, no agent can improve its payoff by unilaterally changing its strategy. A 2×2 matrix game (two agents, two actions each) has a simple payoff matrix; Nash equilibria can be pure (one action per agent) or mixed (randomize over actions). In independent learning, each agent updates its policy based on its own experience without explicitly modeling the other; this can lead to non-stationarity (the other agent’s policy changes) and convergence to non-Nash outcomes (e.g. mutual defection in Prisoner’s Dilemma even when both could do better). In game AI and trading, understanding Nash and learning dynamics helps design and analyze multi-agent systems.

Where you see this in practice: Nash equilibrium in games and economics; independent learning and self-play; convergence issues in MARL.

Illustration (2×2 payoff): In Prisoner’s Dilemma, the Nash equilibrium is (Defect, Defect) even though (Cooperate, Cooperate) gives higher payoff to both. The chart below shows row player payoffs for the four outcomes.

Exercise: Compute the Nash equilibrium of a simple 2×2 payoff matrix (e.g., Prisoner’s Dilemma). Explain why independent learning might converge to a different outcome.

Professor’s hints

Prisoner’s Dilemma: Rows = agent 1 (Cooperate, Defect), columns = agent 2. Typical payoffs: (C,C)=(−1,−1), (C,D)=(−3,0), (D,C)=(0,−3), (D,D)=(−2,−2). Nash equilibrium (in pure strategies) is (D,D): each player’s best response to the other’s D is D. But (C,C) is better for both—the “dilemma.”
Computing Nash: For pure strategies, check each joint action: is it a best response for both? For mixed strategies (2×2), set up indifference equations: agent 1’s mix (p, 1-p) such that agent 2 is indifferent between its actions; solve for p and q.
Independent learning: If both agents run Q-learning or gradient ascent, they may both learn to Defect (converge to (D,D)) because each is best-responding to the current policy of the other. They do not coordinate to (C,C). Explain this in 2–3 sentences.
Optionally run a simple simulation: two Q-learning agents in the matrix game; log their policies over time. Do they converge to Nash?

Common pitfalls

Payoff order: In a payoff matrix, (row, col) often means (agent1, agent2). Check the convention (who is row, who is column) and stick to it.
Multiple equilibria: Some games have more than one Nash equilibrium; mention which one(s) you found.
Independent learning ≠ Nash: Independent learners do not necessarily converge to Nash; they may cycle or converge to a different outcome. The exercise asks you to explain this possibility.

Worked solution (warm-up: independent learners)

Key idea: Independent learners each run a single-agent algorithm (e.g. Q-learning) and treat others as part of the environment. The “environment” is non-stationary because other agents are learning too. So convergence guarantees for single-agent RL do not apply; we may get cycles (e.g. rock-paper-scissors) or convergence to a non-Nash outcome. Centralized training with decentralized execution (CTDE) or opponent modeling can help.

Extra practice

Warm-up: In the Prisoner’s Dilemma, what is agent 1’s best response if agent 2 plays Cooperate? If agent 2 plays Defect? So what is the Nash equilibrium?
Coding: Implement two independent Q-learning agents in a 2×2 matrix game (Prisoner’s Dilemma). Run for 5000 episodes; each episode = one simultaneous action choice. Plot each agent’s probability of Defect over time. Do they converge to (D,D)?
Challenge: Find the mixed-strategy Nash equilibrium of a 2×2 game (e.g. Matching Pennies: (1,-1), (-1,1) for same/different). Compute it by hand and verify with a small script that checks best responses.
Variant: Change the Prisoner’s Dilemma payoffs to a Stag Hunt (cooperation is individually risky but jointly optimal). How many Nash equilibria exist, and which one do two independent Q-learners converge to? Does initialization affect which equilibrium is reached?
Debug: Two Q-learning agents playing Matching Pennies alternate between (H,H) and (T,T) but never converge. The learning rate α=0.5 is too high, causing Q-values to oscillate. Describe how to use a decaying learning rate schedule and what convergence criterion you would use for a cyclic game.
Conceptual: A Nash equilibrium is a fixed point where no agent wants to deviate given the other’s strategy. Why is this a fixed point rather than an optimum? Describe a game where the Nash equilibrium is Pareto-suboptimal — both agents could be better off, but neither would unilaterally deviate.