Learning objectives
- Implement self-play in a simple game (e.g. Tic-Tac-Toe): two copies of the same agent (or two agents with shared or separate parameters) play against each other; update the policy from the outcomes.
- Update both agents (or the single policy) so that they improve against the current opponent (themselves).
- Track an ELO rating (or win rate vs a fixed baseline) as training progresses to measure improvement.
- Explain why self-play can lead to stronger policies (the opponent is always at the current level) and potential pitfalls (cycling, forgetting past strategies).
- Relate self-play to game AI (AlphaGo, Dota) and dialogue (negotiation, debate).
Concept and real-world RL
Self-play means training by having the agent (or multiple copies) play against itself. Both sides are updated from the game outcomes (e.g. policy gradient for the winner, or both get reward based on win/loss). This creates a curriculum: the opponent is always at the current level, so the agent is constantly challenged. ELO (or similar) tracks strength over time by comparing win rate against a fixed set of checkpoints or a population. In game AI (AlphaGo, Dota, StarCraft), self-play and league training (maintaining a population of opponents) have been key to superhuman performance. In dialogue, self-play can train negotiation or debate agents.
Where you see this in practice: AlphaGo self-play; OpenAI Five; league training in multi-agent games.
Illustration (ELO over self-play): As training progresses, the agent’s ELO (or win rate vs random) typically increases. The chart below shows ELO (or win rate × 100) over self-play games.
Exercise: Implement self-play in a simple game like Tic-Tac-Toe. Let two copies of an agent play against each other, and update both. Track the ELO rating as training progresses.
Professor’s hints
- Tic-Tac-Toe: 3×3 grid; two players alternate; state = board (e.g. 9 values: 0 empty, 1 P1, 2 P2). Actions = 9 cells. Reward: +1 win, -1 loss, 0 draw (or 0 for draw). Terminal when win or draw.
- Self-play: One policy π(a|s) for both players (symmetric). When it’s P1’s turn, use π(·|s); when P2’s turn, use π(·|s’) where s’ might be from P2’s view (e.g. swap 1 and 2 in the board so the policy sees “I am always player 1”). After a game, compute return for each player (e.g. +1 for winner, -1 for loser) and update π with policy gradient (REINFORCE or PPO).
- ELO: Maintain a rating R. After a game, update R using the result and the opponent’s rating (if you have a fixed opponent or a pool). Or track win rate vs a random agent or an older checkpoint every N games. Plot R or win rate vs games.
- Update both: If using one shared π, one gradient step uses both players’ trajectories (with opposite signs for reward). If using two separate policies, update both with their respective rewards.
Common pitfalls
- State representation: For symmetry, the same player should see the same input (e.g. “my pieces = 1, opponent = -1” and swap for the other player). Otherwise the policy may learn asymmetric play.
- Draw handling: Tic-Tac-Toe has many draws once both play well; reward draw as 0 and ensure the policy gets a gradient (e.g. small reward for draw so it learns to avoid loss).
- ELO implementation: Standard ELO: expected score = 1/(1+10^((R_opp-R_self)/400)); update R_self += K * (actual_score - expected_score). For a single agent, you can track R vs a fixed random agent.
Worked solution (warm-up: ELO)
Key idea: ELO rates agents by pairwise comparison: expected score of A vs B is \(1/(1+10^{(R_B - R_A)/400})\). After a game, update \(R_A += K (\text{actual} - \text{expected})\). So we get a single number per agent that reflects strength; we can rank many agents without a round-robin. Used in games and to track progress in self-play (e.g. AlphaStar). For MARL we can report ELO vs a baseline pool.
Extra practice
- Warm-up: Why might an agent trained only against a random opponent fail against a copy of itself? Why does self-play help?
- Coding: Implement Tic-Tac-Toe and self-play with a single shared policy (REINFORCE or PPO). Train for 5k games. Every 500 games, evaluate: win rate vs random, vs previous checkpoint. Plot win rate vs random and vs self (previous) over training.
- Challenge: Implement league training: keep a pool of K past policies (checkpoints). Each training game, with probability p play against the current policy and with probability 1-p play against a random choice from the pool. Does this improve robustness and ELO compared to pure self-play?