Curriculum

Chapter 11: Monte Carlo Methods

Learning objectives Implement first-visit Monte Carlo prediction: estimate \(V^\pi(s)\) by averaging returns from the first time \(s\) is visited in each episode. Use a Gym/Gymnasium blackjack environment and a fixed policy (stick on 20/21, else hit). Interpret value estimates for key states (e.g. usable ace, dealer showing 10). Concept and real-world RL Monte Carlo (MC) methods estimate value functions from experience: run episodes under a policy, compute the return from each state (or state-action), and average those returns. First-visit MC uses only the first time each state appears in an episode; every-visit MC uses every visit. No model (transition probabilities) is needed—only sample trajectories. In RL, MC is used when we can get full episodes (e.g. games, episodic tasks) and want simple, unbiased estimates. Game AI is a natural fit: blackjack has a small state space (player sum, dealer card, usable ace), stochastic transitions (card draws), and a clear “stick or hit” policy to evaluate. The same idea applies to evaluating a fixed strategy in any episodic game—we run many episodes and average the returns from each state. ...

Chapter 12: Temporal Difference (TD) Learning

Learning objectives Implement TD(0) prediction: update \(V(s)\) using the TD target \(r + \gamma V(s’)\) immediately after each transition. Compare TD(0) with Monte Carlo in terms of convergence speed and sample efficiency. Understand bootstrapping: TD uses current estimates instead of waiting for episode end. Concept and real-world RL Temporal Difference (TD) learning updates value estimates using the TD target \(r + \gamma V(s’)\): \(V(s) \leftarrow V(s) + \alpha [r + \gamma V(s’) - V(s)]\). Unlike Monte Carlo, TD does not need to wait for the episode to end; it bootstraps on the current estimate of \(V(s’)\). TD(0) often converges faster per sample and works in continuing tasks. In practice, TD is the basis for SARSA, Q-learning, and many deep RL algorithms (e.g. DQN uses a TD-like target). Blackjack lets you compare TD(0) and MC on the same policy and state space. ...

Monte Carlo in Code

Learning objectives Implement first-visit Monte Carlo policy evaluation in code (returns, averaging). Implement Monte Carlo control (estimate Q, improve policy greedily). Implement MC control without exploring starts (e.g. epsilon-greedy behavior). Monte Carlo policy evaluation in code Setup: You have an episodic environment (e.g. blackjack, gridworld) and a fixed policy \(\pi\). Goal: estimate \(V^\pi(s)\). Algorithm: Run an episode: follow \(\pi\), collect \((s_0, r_1, s_1, r_2, \ldots, r_T, s_T)\). For each step \(t\), compute the return from \(t\): \(G_t = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{T-t-1} r_T\) (or loop backward from the end). First-visit: For each state \(s\) that appears in the episode, find the first time \(t\) with \(s_t = s\). Add \(G_t\) to a list (or running sum) for state \(s\); increment the count for \(s\). After many episodes: \(V(s) = \) (sum of returns from first visits to \(s\)) / (count of first visits to \(s\)). Code sketch: Use a dict returns[s] = [] or (total, count). In each episode, track which states have been seen; on first visit to \(s\) at step \(t\), append \(G_t\) (or add to total and increment count). At the end of all episodes, \(V(s) = \) mean of returns for \(s\). ...

Chapter 13: SARSA (On-Policy TD Control)

Learning objectives Implement SARSA: update \(Q(s,a)\) using the transition \((s,a,r,s’,a’)\) with target \(r + \gamma Q(s’,a’)\). Use \(\epsilon\)-greedy exploration for behavior and learn the same policy you follow (on-policy). Interpret learning curves (sum of rewards per episode) on Cliff Walking. Concept and real-world RL SARSA is an on-policy TD control method: it updates \(Q(s,a)\) using the actual next action \(a’\) chosen by the current policy, so it learns the value of the behavior policy (the one you are following). The update is \(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s’,a’) - Q(s,a)]\). Because \(a’\) can be exploratory, SARSA accounts for the risk of exploration (e.g. stepping off the cliff by accident) and often learns a safer policy than Q-learning on Cliff Walking. In real applications, on-policy methods are used when you want to optimize the same policy you use for data collection (e.g. safe robotics). ...

TD, SARSA, and Q-Learning in Code

Learning objectives Implement TD(0) prediction in code: update \(V(s)\) after each transition. Implement SARSA (on-policy TD control): update \(Q(s,a)\) using the next action from the behavior policy. Implement Q-learning (off-policy TD control): update \(Q(s,a)\) using the max over next actions. TD(0) prediction in code Goal: Estimate \(V^\pi\) for a fixed policy \(\pi\). Update: After each transition \((s, r, s’)\): [ V(s) \leftarrow V(s) + \alpha \bigl[ r + \gamma V(s’) - V(s) \bigr] ] Use \(V(s’) = 0\) if \(s’\) is terminal. ...

Chapter 14: Q-Learning (Off-Policy TD Control)

Learning objectives Implement Q-learning: update \(Q(s,a)\) using target \(r + \gamma \max_{a’} Q(s’,a’)\) (off-policy). Compare Q-learning and SARSA on Cliff Walking: paths and reward curves. Explain why Q-learning can learn a riskier policy (cliff edge) than SARSA. Concept and real-world RL Q-learning is off-policy: it updates \(Q(s,a)\) using the greedy next action (\(\max_{a’} Q(s’,a’)\)), so it learns the value of the optimal policy while you can behave with \(\epsilon\)-greedy (or any exploration). The update is \(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a’} Q(s’,a’) - Q(s,a)]\). On Cliff Walking, Q-learning often converges to the shortest path along the cliff (high reward when no exploration, but dangerous if you occasionally take a random step). SARSA learns the actual policy including exploration and tends to stay away from the cliff. In practice, Q-learning is simple and widely used (e.g. DQN); when safety matters, on-policy or conservative methods may be preferred. ...

Chapter 15: Expected SARSA

Learning objectives Implement Expected SARSA: use \(\sum_{a’} \pi(a’|s’) Q(s’,a’)\) as the target instead of \(\max_{a’} Q(s’,a’)\) or \(Q(s’,a’)\). Relate Expected SARSA to SARSA (on-policy) and Q-learning (max); it can be used on- or off-policy depending on \(\pi\). Compare update variance and learning curves with Q-learning. Concept and real-world RL Expected SARSA uses the expected next action value under a policy \(\pi\): target = \(r + \gamma \sum_{a’} \pi(a’|s’) Q(s’,a’)\). For \(\epsilon\)-greedy \(\pi\), this is \(r + \gamma [(1-\epsilon) \max_{a’} Q(s’,a’) + \epsilon \cdot \text{(uniform over actions)}]\). It reduces the variance of the update (compared to SARSA, which uses a single sample \(Q(s’,a’)\)) and can be more stable. When \(\pi\) is greedy, Expected SARSA becomes Q-learning. In practice, it is a middle ground between SARSA and Q-learning and is used in some deep RL variants. ...

Chapter 16: N-Step Bootstrapping

Learning objectives Implement n-step SARSA: accumulate \(n\) steps of experience, then update \(Q(s_0,a_0)\) using the n-step return \(r_1 + \gamma r_2 + \cdots + \gamma^{n-1} r_n + \gamma^n Q(s_n,a_n)\). Compare n-step (\(n=4\)) with one-step SARSA on Cliff Walking (learning speed, stability). Understand the trade-off: n-step uses more information per update but delays the update. Concept and real-world RL N-step bootstrapping uses a return over \(n\) steps: \(G_{t:t+n} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})\) (or \(Q(s_{t+n},a_{t+n})\) for SARSA). \(n=1\) is TD(0); \(n=\infty\) (until terminal) is Monte Carlo. Intermediate \(n\) balances bias and variance. In practice, n-step methods (e.g. n-step SARSA, A3C’s n-step returns) can learn faster than one-step when \(n\) is chosen well; too large \(n\) delays updates and can hurt in non-stationary or long episodes. ...

Chapter 17: Planning and Learning with Tabular Methods

Learning objectives Implement a simple model: store \((s,a) \rightarrow (r, s’)\) from experience. Implement Dyna-Q: after each real env step, do \(k\) extra Q-updates using random \((s,a)\) from the model (simulated experience). Compare sample efficiency: Dyna-Q (planning + learning) vs Q-learning (learning only). Concept and real-world RL Model-based methods use a learned or given model of the environment (transition and reward). Dyna-Q learns a tabular model from real experience: when you observe \((s,a,r,s’)\), store it. Then, in addition to updating \(Q(s,a)\) from the real transition, you replay random \((s,a)\) from the model, look up \((r,s’)\), and do a Q-learning update. This gives more learning per real step (planning). In real applications, learned models are used in model-based RL (e.g. world models, MuZero) to reduce sample complexity; the key idea is reusing past experience for extra updates. ...

Chapter 18: Custom Gym Environments (Part 1)

Learning objectives Create a custom Gymnasium (or Gym) environment: inherit from gym.Env, implement reset, step, and optional render. Define observation_space and action_space (e.g. Discrete(4) for up/down/left/right). Implement a text-based render (e.g. print a grid with agent and goal). Concept and real-world RL Real RL often requires custom environments: simulators for robotics, games, or domain-specific tasks. The Gym API (reset, step, observation_space, action_space) is the standard. Implementing a small maze teaches you how to encode state (e.g. agent position), handle boundaries and obstacles, and return (obs, reward, terminated, truncated, info). In practice, you will wrap or write envs for your problem and reuse the same agents (e.g. Q-learning, DQN) trained on standard envs. ...