Learning objectives
- Implement first-visit Monte Carlo policy evaluation in code (returns, averaging).
- Implement Monte Carlo control (estimate Q, improve policy greedily).
- Implement MC control without exploring starts (e.g. epsilon-greedy behavior).
Monte Carlo policy evaluation in code
Setup: You have an episodic environment (e.g. blackjack, gridworld) and a fixed policy \(\pi\). Goal: estimate \(V^\pi(s)\).
Algorithm:
- Run an episode: follow \(\pi\), collect \((s_0, r_1, s_1, r_2, \ldots, r_T, s_T)\).
- For each step \(t\), compute the return from \(t\): \(G_t = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{T-t-1} r_T\) (or loop backward from the end).
- First-visit: For each state \(s\) that appears in the episode, find the first time \(t\) with \(s_t = s\). Add \(G_t\) to a list (or running sum) for state \(s\); increment the count for \(s\).
- After many episodes: \(V(s) = \) (sum of returns from first visits to \(s\)) / (count of first visits to \(s\)).
Code sketch: Use a dict returns[s] = [] or (total, count). In each episode, track which states have been seen; on first visit to \(s\) at step \(t\), append \(G_t\) (or add to total and increment count). At the end of all episodes, \(V(s) = \) mean of returns for \(s\).
Monte Carlo control in code
Goal: Learn \(Q^*(s,a)\) and an optimal policy.
MC control with exploring starts:
- Initialize \(Q(s,a)\) arbitrarily; policy \(\pi\) e.g. greedy w.r.t. \(Q\).
- Loop: pick a random (s, a) pair to start an episode (exploring start). Run episode from that (s,a), then follow \(\pi\).
- For each (s,a) visited, compute return from that (s,a) to end of episode. First-visit: for each (s,a) pair, use only the return from its first occurrence in the episode.
- Update \(Q(s,a) = \) average of those returns (over episodes).
- Improve policy: \(\pi(s) = \arg\max_a Q(s,a)\).
- Repeat until convergence.
Code: Store returns per (s,a): returns[(s,a)] = (sum, count). After each episode, for first-visit (s,a) add the return from that step; then set \(Q(s,a) = \) sum/count. Set \(\pi(s) = \arg\max_a Q(s,a)\).
Monte Carlo control without exploring starts in code
Exploring starts are often unrealistic (we cannot choose the first state in many tasks). Alternative: use an epsilon-greedy policy: with probability \(\epsilon\) choose a random action, with probability \(1-\epsilon\) choose \(\arg\max_a Q(s,a)\). Run episodes by starting from a normal start state (e.g. environment reset). Use first-visit MC to update \(Q(s,a)\) from the returns. Then set \(\pi\) to epsilon-greedy w.r.t. \(Q\) (or decay \(\epsilon\) over time). The policy is on-policy: we evaluate and improve the same epsilon-greedy policy.
Code: Same as above but no random (s,a) start—start from env.reset(). Use epsilon-greedy to select actions during the episode. Update \(Q\) from first-visit (s,a) returns; keep \(\epsilon\) fixed or decay it.
See Chapter 11: Monte Carlo Methods for theory and the blackjack exercise, and TD, SARSA, Q-Learning in Code for temporal-difference algorithms.