Chapter 54: Monte Carlo Tree Search (MCTS)

Learning objectives Implement MCTS for a small game (e.g. tic-tac-toe): selection (UCT), expansion, simulation (rollout), backpropagation. Use UCT (Upper Confidence bound for Trees) for node selection: \(\frac{Q(s,a)}{N(s,a)} + c \sqrt{\frac{\log N(s)}{N(s,a)}}\). Evaluate win rate against a random opponent. Concept and real-world RL MCTS builds a search tree by repeatedly selecting a leaf (UCT), expanding it, doing a random rollout to the end, and backpropagating the result. It does not require a learned value function (though it can use one, as in AlphaZero). In game AI (chess, Go, tic-tac-toe), MCTS is used for planning and action selection; it balances exploration (trying undervisited moves) and exploitation (favoring good moves). ...

March 10, 2026 · 3 min · 444 words · codefrydev

Chapter 55: AlphaZero Architecture

Learning objectives Implement a simplified AlphaZero for tic-tac-toe: a neural network that outputs policy (move probabilities) and value (expected outcome). Use the network inside MCTS: use policy for prior in expansion, value for leaf evaluation (replacing random rollout). Train via self-play: generate games, train the network on (state, policy target, value target), repeat. Concept and real-world RL AlphaZero combines MCTS with a neural network: the network provides a prior over moves and a value for leaf states, so MCTS does not need random rollouts. Training is self-play: the current network plays against itself; the MCTS policy and game outcome become targets for the network. In game AI (chess, Go, shogi), AlphaZero achieves superhuman play. The same idea (planning with a learned model/value) appears in robot planning and dialogue. ...

March 10, 2026 · 3 min · 460 words · codefrydev