Chapter 14: Q-Learning (Off-Policy TD Control)

Learning objectives Implement Q-learning: update \(Q(s,a)\) using target \(r + \gamma \max_{a’} Q(s’,a’)\) (off-policy). Compare Q-learning and SARSA on Cliff Walking: paths and reward curves. Explain why Q-learning can learn a riskier policy (cliff edge) than SARSA. Concept and real-world RL Q-learning is off-policy: it updates \(Q(s,a)\) using the greedy next action (\(\max_{a’} Q(s’,a’)\)), so it learns the value of the optimal policy while you can behave with \(\epsilon\)-greedy (or any exploration). The update is \(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a’} Q(s’,a’) - Q(s,a)]\). On Cliff Walking, Q-learning often converges to the shortest path along the cliff (high reward when no exploration, but dangerous if you occasionally take a random step). SARSA learns the actual policy including exploration and tends to stay away from the cliff. In practice, Q-learning is simple and widely used (e.g. DQN); when safety matters, on-policy or conservative methods may be preferred. ...

March 10, 2026 · 3 min · 589 words · codefrydev