Chapter 11: Monte Carlo Methods

Learning objectives Implement first-visit Monte Carlo prediction: estimate \(V^\pi(s)\) by averaging returns from the first time \(s\) is visited in each episode. Use a Gym/Gymnasium blackjack environment and a fixed policy (stick on 20/21, else hit). Interpret value estimates for key states (e.g. usable ace, dealer showing 10). Concept and real-world RL Monte Carlo (MC) methods estimate value functions from experience: run episodes under a policy, compute the return from each state (or state-action), and average those returns. First-visit MC uses only the first time each state appears in an episode; every-visit MC uses every visit. No model (transition probabilities) is needed—only sample trajectories. In RL, MC is used when we can get full episodes (e.g. games, episodic tasks) and want simple, unbiased estimates. Game AI is a natural fit: blackjack has a small state space (player sum, dealer card, usable ace), stochastic transitions (card draws), and a clear “stick or hit” policy to evaluate. The same idea applies to evaluating a fixed strategy in any episodic game—we run many episodes and average the returns from each state. ...

March 10, 2026 · 4 min · 777 words · codefrydev

Monte Carlo in Code

Learning objectives Implement first-visit Monte Carlo policy evaluation in code (returns, averaging). Implement Monte Carlo control (estimate Q, improve policy greedily). Implement MC control without exploring starts (e.g. epsilon-greedy behavior). Monte Carlo policy evaluation in code Setup: You have an episodic environment (e.g. blackjack, gridworld) and a fixed policy \(\pi\). Goal: estimate \(V^\pi(s)\). Algorithm: Run an episode: follow \(\pi\), collect \((s_0, r_1, s_1, r_2, \ldots, r_T, s_T)\). For each step \(t\), compute the return from \(t\): \(G_t = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{T-t-1} r_T\) (or loop backward from the end). First-visit: For each state \(s\) that appears in the episode, find the first time \(t\) with \(s_t = s\). Add \(G_t\) to a list (or running sum) for state \(s\); increment the count for \(s\). After many episodes: \(V(s) = \) (sum of returns from first visits to \(s\)) / (count of first visits to \(s\)). Code sketch: Use a dict returns[s] = [] or (total, count). In each episode, track which states have been seen; on first visit to \(s\) at step \(t\), append \(G_t\) (or add to total and increment count). At the end of all episodes, \(V(s) = \) mean of returns for \(s\). ...

March 10, 2026 · 3 min · 464 words · codefrydev

Tabular Methods

This page covers the tabular methods you need for the preliminary assessment: policy iteration and value iteration, the difference between Monte Carlo and TD, on-policy vs off-policy learning, and the Q-learning update rule. Back to Preliminary. Why this matters for RL When the state and action spaces are small enough, we can store one value per state (or state-action) and update them from experience or from the model. Dynamic programming does this when we know the model; Monte Carlo and TD do it from samples. Q-learning is the canonical off-policy TD method and is the basis of many deep RL algorithms (e.g. DQN). You need to know how these methods differ and how to write the Q-learning update. ...

March 10, 2026 · 6 min · 1277 words · codefrydev