Chapter 15: Expected SARSA
Learning objectives Implement Expected SARSA: use \(\sum_{a’} \pi(a’|s’) Q(s’,a’)\) as the target instead of \(\max_{a’} Q(s’,a’)\) or \(Q(s’,a’)\). Relate Expected SARSA to SARSA (on-policy) and Q-learning (max); it can be used on- or off-policy depending on \(\pi\). Compare update variance and learning curves with Q-learning. Concept and real-world RL Expected SARSA uses the expected next action value under a policy \(\pi\): target = \(r + \gamma \sum_{a’} \pi(a’|s’) Q(s’,a’)\). For \(\epsilon\)-greedy \(\pi\), this is \(r + \gamma [(1-\epsilon) \max_{a’} Q(s’,a’) + \epsilon \cdot \text{(uniform over actions)}]\). It reduces the variance of the update (compared to SARSA, which uses a single sample \(Q(s’,a’)\)) and can be more stable. When \(\pi\) is greedy, Expected SARSA becomes Q-learning. In practice, it is a middle ground between SARSA and Q-learning and is used in some deep RL variants. ...