Chapter 89: Self-Play and League Training
Learning objectives Implement self-play in a simple game (e.g. Tic-Tac-Toe): two copies of the same agent (or two agents with shared or separate parameters) play against each other; update the policy from the outcomes. Update both agents (or the single policy) so that they improve against the current opponent (themselves). Track an ELO rating (or win rate vs a fixed baseline) as training progresses to measure improvement. Explain why self-play can lead to stronger policies (the opponent is always at the current level) and potential pitfalls (cycling, forgetting past strategies). Relate self-play to game AI (AlphaGo, Dota) and dialogue (negotiation, debate). Concept and real-world RL ...