Chapter 13: SARSA (On-Policy TD Control)

Learning objectives Implement SARSA: update \(Q(s,a)\) using the transition \((s,a,r,s’,a’)\) with target \(r + \gamma Q(s’,a’)\). Use \(\epsilon\)-greedy exploration for behavior and learn the same policy you follow (on-policy). Interpret learning curves (sum of rewards per episode) on Cliff Walking. Concept and real-world RL SARSA is an on-policy TD control method: it updates \(Q(s,a)\) using the actual next action \(a’\) chosen by the current policy, so it learns the value of the behavior policy (the one you are following). The update is \(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s’,a’) - Q(s,a)]\). Because \(a’\) can be exploratory, SARSA accounts for the risk of exploration (e.g. stepping off the cliff by accident) and often learns a safer policy than Q-learning on Cliff Walking. In real applications, on-policy methods are used when you want to optimize the same policy you use for data collection (e.g. safe robotics). ...

March 10, 2026 · 3 min · 541 words · codefrydev

Chapter 14: Q-Learning (Off-Policy TD Control)

Learning objectives Implement Q-learning: update \(Q(s,a)\) using target \(r + \gamma \max_{a’} Q(s’,a’)\) (off-policy). Compare Q-learning and SARSA on Cliff Walking: paths and reward curves. Explain why Q-learning can learn a riskier policy (cliff edge) than SARSA. Concept and real-world RL Q-learning is off-policy: it updates \(Q(s,a)\) using the greedy next action (\(\max_{a’} Q(s’,a’)\)), so it learns the value of the optimal policy while you can behave with \(\epsilon\)-greedy (or any exploration). The update is \(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a’} Q(s’,a’) - Q(s,a)]\). On Cliff Walking, Q-learning often converges to the shortest path along the cliff (high reward when no exploration, but dangerous if you occasionally take a random step). SARSA learns the actual policy including exploration and tends to stay away from the cliff. In practice, Q-learning is simple and widely used (e.g. DQN); when safety matters, on-policy or conservative methods may be preferred. ...

March 10, 2026 · 3 min · 589 words · codefrydev

Chapter 16: N-Step Bootstrapping

Learning objectives Implement n-step SARSA: accumulate \(n\) steps of experience, then update \(Q(s_0,a_0)\) using the n-step return \(r_1 + \gamma r_2 + \cdots + \gamma^{n-1} r_n + \gamma^n Q(s_n,a_n)\). Compare n-step (\(n=4\)) with one-step SARSA on Cliff Walking (learning speed, stability). Understand the trade-off: n-step uses more information per update but delays the update. Concept and real-world RL N-step bootstrapping uses a return over \(n\) steps: \(G_{t:t+n} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})\) (or \(Q(s_{t+n},a_{t+n})\) for SARSA). \(n=1\) is TD(0); \(n=\infty\) (until terminal) is Monte Carlo. Intermediate \(n\) balances bias and variance. In practice, n-step methods (e.g. n-step SARSA, A3C’s n-step returns) can learn faster than one-step when \(n\) is chosen well; too large \(n\) delays updates and can hurt in non-stationary or long episodes. ...

March 10, 2026 · 3 min · 557 words · codefrydev

Chapter 19: Hyperparameter Tuning in Tabular RL

Learning objectives Run a grid search over learning rate \(\alpha\) and exploration \(\epsilon\) for Q-learning. Aggregate results over multiple trials (e.g. mean reward per episode) and visualize with a heatmap. Interpret which hyperparameter combinations work best and why. Concept and real-world RL Hyperparameters (e.g. \(\alpha\), \(\epsilon\), \(\gamma\)) strongly affect learning speed and final performance. Grid search tries every combination in a predefined set; it is simple but costly when there are many parameters. In practice, RL tuning often uses grid search for 2–3 key parameters, or Bayesian optimization / bandit-based tuning for larger spaces. Reporting mean and std over multiple seeds is essential because RL is noisy. Heatmaps (e.g. \(\alpha\) vs \(\epsilon\) with color = mean reward) make good and bad regions visible at a glance. ...

March 10, 2026 · 3 min · 608 words · codefrydev