Chapter 55: AlphaZero Architecture
Learning objectives Implement a simplified AlphaZero for tic-tac-toe: a neural network that outputs policy (move probabilities) and value (expected outcome). Use the network inside MCTS: use policy for prior in expansion, value for leaf evaluation (replacing random rollout). Train via self-play: generate games, train the network on (state, policy target, value target), repeat. Concept and real-world RL AlphaZero combines MCTS with a neural network: the network provides a prior over moves and a value for leaf states, so MCTS does not need random rollouts. Training is self-play: the current network plays against itself; the MCTS policy and game outcome become targets for the network. In game AI (chess, Go, shogi), AlphaZero achieves superhuman play. The same idea (planning with a learned model/value) appears in robot planning and dialogue. ...