Chapter 12: Temporal Difference (TD) Learning

Learning objectives Implement TD(0) prediction: update \(V(s)\) using the TD target \(r + \gamma V(s’)\) immediately after each transition. Compare TD(0) with Monte Carlo in terms of convergence speed and sample efficiency. Understand bootstrapping: TD uses current estimates instead of waiting for episode end. Concept and real-world RL Temporal Difference (TD) learning updates value estimates using the TD target \(r + \gamma V(s’)\): \(V(s) \leftarrow V(s) + \alpha [r + \gamma V(s’) - V(s)]\). Unlike Monte Carlo, TD does not need to wait for the episode to end; it bootstraps on the current estimate of \(V(s’)\). TD(0) often converges faster per sample and works in continuing tasks. In practice, TD is the basis for SARSA, Q-learning, and many deep RL algorithms (e.g. DQN uses a TD-like target). Blackjack lets you compare TD(0) and MC on the same policy and state space. ...

March 10, 2026 · 3 min · 589 words · codefrydev

TD, SARSA, and Q-Learning in Code

Learning objectives Implement TD(0) prediction in code: update \(V(s)\) after each transition. Implement SARSA (on-policy TD control): update \(Q(s,a)\) using the next action from the behavior policy. Implement Q-learning (off-policy TD control): update \(Q(s,a)\) using the max over next actions. TD(0) prediction in code Goal: Estimate \(V^\pi\) for a fixed policy \(\pi\). Update: After each transition \((s, r, s’)\): [ V(s) \leftarrow V(s) + \alpha \bigl[ r + \gamma V(s’) - V(s) \bigr] ] Use \(V(s’) = 0\) if \(s’\) is terminal. ...

March 10, 2026 · 2 min · 351 words · codefrydev