Chapter 12: Temporal Difference (TD) Learning

Learning objectives Implement TD(0) prediction: update \(V(s)\) using the TD target \(r + \gamma V(s’)\) immediately after each transition. Compare TD(0) with Monte Carlo in terms of convergence speed and sample efficiency. Understand bootstrapping: TD uses current estimates instead of waiting for episode end. Concept and real-world RL Temporal Difference (TD) learning updates value estimates using the TD target \(r + \gamma V(s’)\): \(V(s) \leftarrow V(s) + \alpha [r + \gamma V(s’) - V(s)]\). Unlike Monte Carlo, TD does not need to wait for the episode to end; it bootstraps on the current estimate of \(V(s’)\). TD(0) often converges faster per sample and works in continuing tasks. In practice, TD is the basis for SARSA, Q-learning, and many deep RL algorithms (e.g. DQN uses a TD-like target). Blackjack lets you compare TD(0) and MC on the same policy and state space. ...

March 10, 2026 · 3 min · 589 words · codefrydev

Tabular Methods

This page covers the tabular methods you need for the preliminary assessment: policy iteration and value iteration, the difference between Monte Carlo and TD, on-policy vs off-policy learning, and the Q-learning update rule. Back to Preliminary. Why this matters for RL When the state and action spaces are small enough, we can store one value per state (or state-action) and update them from experience or from the model. Dynamic programming does this when we know the model; Monte Carlo and TD do it from samples. Q-learning is the canonical off-policy TD method and is the basis of many deep RL algorithms (e.g. DQN). You need to know how these methods differ and how to write the Q-learning update. ...

March 10, 2026 · 6 min · 1277 words · codefrydev