Chapter 13: SARSA (On-Policy TD Control)
Learning objectives Implement SARSA: update \(Q(s,a)\) using the transition \((s,a,r,s’,a’)\) with target \(r + \gamma Q(s’,a’)\). Use \(\epsilon\)-greedy exploration for behavior and learn the same policy you follow (on-policy). Interpret learning curves (sum of rewards per episode) on Cliff Walking. Concept and real-world RL SARSA is an on-policy TD control method: it updates \(Q(s,a)\) using the actual next action \(a’\) chosen by the current policy, so it learns the value of the behavior policy (the one you are following). The update is \(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s’,a’) - Q(s,a)]\). Because \(a’\) can be exploratory, SARSA accounts for the risk of exploration (e.g. stepping off the cliff by accident) and often learns a safer policy than Q-learning on Cliff Walking. In real applications, on-policy methods are used when you want to optimize the same policy you use for data collection (e.g. safe robotics). ...