Chapter 11: Monte Carlo Methods

Learning objectives Implement first-visit Monte Carlo prediction: estimate \(V^\pi(s)\) by averaging returns from the first time \(s\) is visited in each episode. Use a Gym/Gymnasium blackjack environment and a fixed policy (stick on 20/21, else hit). Interpret value estimates for key states (e.g. usable ace, dealer showing 10). Concept and real-world RL Monte Carlo (MC) methods estimate value functions from experience: run episodes under a policy, compute the return from each state (or state-action), and average those returns. First-visit MC uses only the first time each state appears in an episode; every-visit MC uses every visit. No model (transition probabilities) is needed—only sample trajectories. In RL, MC is used when we can get full episodes (e.g. games, episodic tasks) and want simple, unbiased estimates. Game AI is a natural fit: blackjack has a small state space (player sum, dealer card, usable ace), stochastic transitions (card draws), and a clear “stick or hit” policy to evaluate. The same idea applies to evaluating a fixed strategy in any episodic game—we run many episodes and average the returns from each state. ...

March 10, 2026 · 4 min · 777 words · codefrydev

Chapter 12: Temporal Difference (TD) Learning

Learning objectives Implement TD(0) prediction: update \(V(s)\) using the TD target \(r + \gamma V(s’)\) immediately after each transition. Compare TD(0) with Monte Carlo in terms of convergence speed and sample efficiency. Understand bootstrapping: TD uses current estimates instead of waiting for episode end. Concept and real-world RL Temporal Difference (TD) learning updates value estimates using the TD target \(r + \gamma V(s’)\): \(V(s) \leftarrow V(s) + \alpha [r + \gamma V(s’) - V(s)]\). Unlike Monte Carlo, TD does not need to wait for the episode to end; it bootstraps on the current estimate of \(V(s’)\). TD(0) often converges faster per sample and works in continuing tasks. In practice, TD is the basis for SARSA, Q-learning, and many deep RL algorithms (e.g. DQN uses a TD-like target). Blackjack lets you compare TD(0) and MC on the same policy and state space. ...

March 10, 2026 · 3 min · 589 words · codefrydev