Chapter 13: SARSA (On-Policy TD Control)

Learning objectives Implement SARSA: update \(Q(s,a)\) using the transition \((s,a,r,s’,a’)\) with target \(r + \gamma Q(s’,a’)\). Use \(\epsilon\)-greedy exploration for behavior and learn the same policy you follow (on-policy). Interpret learning curves (sum of rewards per episode) on Cliff Walking. Concept and real-world RL SARSA is an on-policy TD control method: it updates \(Q(s,a)\) using the actual next action \(a’\) chosen by the current policy, so it learns the value of the behavior policy (the one you are following). The update is \(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s’,a’) - Q(s,a)]\). Because \(a’\) can be exploratory, SARSA accounts for the risk of exploration (e.g. stepping off the cliff by accident) and often learns a safer policy than Q-learning on Cliff Walking. In real applications, on-policy methods are used when you want to optimize the same policy you use for data collection (e.g. safe robotics). ...

March 10, 2026 · 3 min · 541 words · codefrydev

TD, SARSA, and Q-Learning in Code

Learning objectives Implement TD(0) prediction in code: update \(V(s)\) after each transition. Implement SARSA (on-policy TD control): update \(Q(s,a)\) using the next action from the behavior policy. Implement Q-learning (off-policy TD control): update \(Q(s,a)\) using the max over next actions. TD(0) prediction in code Goal: Estimate \(V^\pi\) for a fixed policy \(\pi\). Update: After each transition \((s, r, s’)\): [ V(s) \leftarrow V(s) + \alpha \bigl[ r + \gamma V(s’) - V(s) \bigr] ] Use \(V(s’) = 0\) if \(s’\) is terminal. ...

March 10, 2026 · 2 min · 351 words · codefrydev

Chapter 16: N-Step Bootstrapping

Learning objectives Implement n-step SARSA: accumulate \(n\) steps of experience, then update \(Q(s_0,a_0)\) using the n-step return \(r_1 + \gamma r_2 + \cdots + \gamma^{n-1} r_n + \gamma^n Q(s_n,a_n)\). Compare n-step (\(n=4\)) with one-step SARSA on Cliff Walking (learning speed, stability). Understand the trade-off: n-step uses more information per update but delays the update. Concept and real-world RL N-step bootstrapping uses a return over \(n\) steps: \(G_{t:t+n} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})\) (or \(Q(s_{t+n},a_{t+n})\) for SARSA). \(n=1\) is TD(0); \(n=\infty\) (until terminal) is Monte Carlo. Intermediate \(n\) balances bias and variance. In practice, n-step methods (e.g. n-step SARSA, A3C’s n-step returns) can learn faster than one-step when \(n\) is chosen well; too large \(n\) delays updates and can hurt in non-stationary or long episodes. ...

March 10, 2026 · 3 min · 557 words · codefrydev

Phase 3 Foundations Quiz

Use this quiz after completing Volume 1 and Volume 2 (or the Phase 3 mini-project). If you can answer at least 12 of 15 correctly, you are ready for Phase 4 and Volume 3. 1. RL framework Q: Name the four main components of an RL system (agent, environment, and two more). What is a state? Answer Agent, environment, action, reward. State: a representation of the current situation the agent uses to choose actions. 2. Return Q: For rewards [0, 0, 1] and \(\gamma = 0.9\), compute the discounted return \(G_0\) from step 0. ...

March 10, 2026 · 5 min · 876 words · codefrydev