Learning objectives

  • Implement MCTS for a small game (e.g. tic-tac-toe): selection (UCT), expansion, simulation (rollout), backpropagation.
  • Use UCT (Upper Confidence bound for Trees) for node selection: \(\frac{Q(s,a)}{N(s,a)} + c \sqrt{\frac{\log N(s)}{N(s,a)}}\).
  • Evaluate win rate against a random opponent.

Concept and real-world RL

MCTS builds a search tree by repeatedly selecting a leaf (UCT), expanding it, doing a random rollout to the end, and backpropagating the result. It does not require a learned value function (though it can use one, as in AlphaZero). In game AI (chess, Go, tic-tac-toe), MCTS is used for planning and action selection; it balances exploration (trying undervisited moves) and exploitation (favoring good moves).

Where you see this in practice: AlphaGo, AlphaZero, and many game-playing agents use MCTS (with or without a neural network).

Illustration (MCTS win rate): As the number of simulations per move increases, MCTS win rate against a random player typically improves. The chart below shows win rate vs simulations per move.

Exercise: Implement MCTS for a tic-tac-toe environment. Use UCT for node selection. Let the algorithm play against a random opponent. Evaluate its win rate.

Professor’s hints

  • Selection: from root, follow the child that maximizes UCT until you reach a leaf (unexpanded node). Expand the leaf (add children for all legal actions), run one random rollout from it, backprop the outcome (win/loss/draw) to all nodes on the path.
  • UCT: Q = total reward from that action, N = visit count. Use c ≈ 1.4 or tune. Run many simulations (e.g. 1000) per move, then pick the most visited child (or highest Q/N).
  • Win rate: play 100 games vs random; MCTS should win almost all (or draw). Report win/draw/loss.

Common pitfalls

  • Backpropagating the right value: In two-player games, negate the result when backpropagating (opponent’s win is our loss). For tic-tac-toe, use +1 win, -1 loss, 0 draw; alternate sign per level.
  • Terminal state: Do not expand a terminal node; backprop immediately with the game result.

Worked solution (warm-up: MCTS)Key idea: MCTS: (1) Select: traverse from root by UCB until a leaf. (2) Expand: add children of the leaf. (3) Rollout: play to termination (or use a value network). (4) Backprop: update visit counts and values along the path. The value at the root is the expected outcome; we choose the action with highest value. Used in AlphaZero for games.

Extra practice

  1. Warm-up: What is the role of the exploration term \(c \sqrt{\log N(s)/N(s,a)}\) in UCT?
  2. Coding: Implement MCTS for tic-tac-toe. Plot win rate vs number of simulations per move (100, 500, 1000). Does win rate increase?
  3. Challenge: Implement MCTS for connect four (or another game). Compare win rate vs random with tic-tac-toe (harder game, need more simulations).