Chapter 55: AlphaZero Architecture

Learning objectives Implement a simplified AlphaZero for tic-tac-toe: a neural network that outputs policy (move probabilities) and value (expected outcome). Use the network inside MCTS: use policy for prior in expansion, value for leaf evaluation (replacing random rollout). Train via self-play: generate games, train the network on (state, policy target, value target), repeat. Concept and real-world RL AlphaZero combines MCTS with a neural network: the network provides a prior over moves and a value for leaf states, so MCTS does not need random rollouts. Training is self-play: the current network plays against itself; the MCTS policy and game outcome become targets for the network. In game AI (chess, Go, shogi), AlphaZero achieves superhuman play. The same idea (planning with a learned model/value) appears in robot planning and dialogue. ...

March 10, 2026 · 3 min · 460 words · codefrydev

Chapter 89: Self-Play and League Training

Learning objectives Implement self-play in a simple game (e.g. Tic-Tac-Toe): two copies of the same agent (or two agents with shared or separate parameters) play against each other; update the policy from the outcomes. Update both agents (or the single policy) so that they improve against the current opponent (themselves). Track an ELO rating (or win rate vs a fixed baseline) as training progresses to measure improvement. Explain why self-play can lead to stronger policies (the opponent is always at the current level) and potential pitfalls (cycling, forgetting past strategies). Relate self-play to game AI (AlphaGo, Dota) and dialogue (negotiation, debate). Concept and real-world RL ...

March 10, 2026 · 4 min · 741 words · codefrydev