TD, SARSA, and Q-Learning in Code
Learning objectives Implement TD(0) prediction in code: update \(V(s)\) after each transition. Implement SARSA (on-policy TD control): update \(Q(s,a)\) using the next action from the behavior policy. Implement Q-learning (off-policy TD control): update \(Q(s,a)\) using the max over next actions. TD(0) prediction in code Goal: Estimate \(V^\pi\) for a fixed policy \(\pi\). Update: After each transition \((s, r, s’)\): [ V(s) \leftarrow V(s) + \alpha \bigl[ r + \gamma V(s’) - V(s) \bigr] ] Use \(V(s’) = 0\) if \(s’\) is terminal. ...