Chapter 16: N-Step Bootstrapping
Learning objectives Implement n-step SARSA: accumulate \(n\) steps of experience, then update \(Q(s_0,a_0)\) using the n-step return \(r_1 + \gamma r_2 + \cdots + \gamma^{n-1} r_n + \gamma^n Q(s_n,a_n)\). Compare n-step (\(n=4\)) with one-step SARSA on Cliff Walking (learning speed, stability). Understand the trade-off: n-step uses more information per update but delays the update. Concept and real-world RL N-step bootstrapping uses a return over \(n\) steps: \(G_{t:t+n} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})\) (or \(Q(s_{t+n},a_{t+n})\) for SARSA). \(n=1\) is TD(0); \(n=\infty\) (until terminal) is Monte Carlo. Intermediate \(n\) balances bias and variance. In practice, n-step methods (e.g. n-step SARSA, A3C’s n-step returns) can learn faster than one-step when \(n\) is chosen well; too large \(n\) delays updates and can hurt in non-stationary or long episodes. ...