Chapter 17: Planning and Learning with Tabular Methods

Learning objectives Implement a simple model: store \((s,a) \rightarrow (r, s’)\) from experience. Implement Dyna-Q: after each real env step, do \(k\) extra Q-updates using random \((s,a)\) from the model (simulated experience). Compare sample efficiency: Dyna-Q (planning + learning) vs Q-learning (learning only). Concept and real-world RL Model-based methods use a learned or given model of the environment (transition and reward). Dyna-Q learns a tabular model from real experience: when you observe \((s,a,r,s’)\), store it. Then, in addition to updating \(Q(s,a)\) from the real transition, you replay random \((s,a)\) from the model, look up \((r,s’)\), and do a Q-learning update. This gives more learning per real step (planning). In real applications, learned models are used in model-based RL (e.g. world models, MuZero) to reduce sample complexity; the key idea is reusing past experience for extra updates. ...

March 10, 2026 · 3 min · 583 words · codefrydev

Chapter 53: Planning with Known Models

Learning objectives Implement a planner using breadth-first search (BFS) for a gridworld with known deterministic dynamics. Recover the optimal policy (path to goal) and compare with dynamic programming (value iteration) in terms of computation and result. Relate BFS to shortest-path planning in robot navigation. Concept and real-world RL When the model is known and deterministic, we can plan without learning: BFS finds the shortest path from start to goal; value iteration computes optimal values for all states. In robot navigation (grid or graph), BFS is used for pathfinding; DP is used when we need values everywhere (e.g. for reward shaping). Both assume the model is correct; in RL we often learn the model or the value function from data. ...

March 10, 2026 · 3 min · 443 words · codefrydev

Chapter 54: Monte Carlo Tree Search (MCTS)

Learning objectives Implement MCTS for a small game (e.g. tic-tac-toe): selection (UCT), expansion, simulation (rollout), backpropagation. Use UCT (Upper Confidence bound for Trees) for node selection: \(\frac{Q(s,a)}{N(s,a)} + c \sqrt{\frac{\log N(s)}{N(s,a)}}\). Evaluate win rate against a random opponent. Concept and real-world RL MCTS builds a search tree by repeatedly selecting a leaf (UCT), expanding it, doing a random rollout to the end, and backpropagating the result. It does not require a learned value function (though it can use one, as in AlphaZero). In game AI (chess, Go, tic-tac-toe), MCTS is used for planning and action selection; it balances exploration (trying undervisited moves) and exploitation (favoring good moves). ...

March 10, 2026 · 3 min · 444 words · codefrydev