Monte Carlo and temporal-difference methods, SARSA and Q-learning, n-step bootstrapping, planning with tabular methods, custom Gym environments, and the limits of tabular methods. Chapters 11–20.
Chapter 11: Monte Carlo Methods
Learning objectives Implement first-visit Monte Carlo prediction: estimate \(V^\pi(s)\) by averaging returns from the first time \(s\) is visited in each episode. Use a Gym/Gymnasium blackjack environment and a fixed policy (stick on 20/21, else hit). Interpret value estimates for key states (e.g. usable ace, dealer showing 10). Concept and real-world RL Monte Carlo (MC) methods estimate value functions from experience: run episodes under a policy, compute the return from each state (or state-action), and average those returns. First-visit MC uses only the first time each state appears in an episode; every-visit MC uses every visit. No model (transition probabilities) is needed—only sample trajectories. In RL, MC is used when we can get full episodes (e.g. games, episodic tasks) and want simple, unbiased estimates. Game AI is a natural fit: blackjack has a small state space (player sum, dealer card, usable ace), stochastic transitions (card draws), and a clear “stick or hit” policy to evaluate. The same idea applies to evaluating a fixed strategy in any episodic game—we run many episodes and average the returns from each state. ...