TD, SARSA, and Q-Learning in Code

Learning objectives Implement TD(0) prediction in code: update \(V(s)\) after each transition. Implement SARSA (on-policy TD control): update \(Q(s,a)\) using the next action from the behavior policy. Implement Q-learning (off-policy TD control): update \(Q(s,a)\) using the max over next actions. TD(0) prediction in code Goal: Estimate \(V^\pi\) for a fixed policy \(\pi\). Update: After each transition \((s, r, s’)\): [ V(s) \leftarrow V(s) + \alpha \bigl[ r + \gamma V(s’) - V(s) \bigr] ] Use \(V(s’) = 0\) if \(s’\) is terminal. ...

March 10, 2026 · 2 min · 351 words · codefrydev

Chapter 14: Q-Learning (Off-Policy TD Control)

Learning objectives Implement Q-learning: update \(Q(s,a)\) using target \(r + \gamma \max_{a’} Q(s’,a’)\) (off-policy). Compare Q-learning and SARSA on Cliff Walking: paths and reward curves. Explain why Q-learning can learn a riskier policy (cliff edge) than SARSA. Concept and real-world RL Q-learning is off-policy: it updates \(Q(s,a)\) using the greedy next action (\(\max_{a’} Q(s’,a’)\)), so it learns the value of the optimal policy while you can behave with \(\epsilon\)-greedy (or any exploration). The update is \(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a’} Q(s’,a’) - Q(s,a)]\). On Cliff Walking, Q-learning often converges to the shortest path along the cliff (high reward when no exploration, but dangerous if you occasionally take a random step). SARSA learns the actual policy including exploration and tends to stay away from the cliff. In practice, Q-learning is simple and widely used (e.g. DQN); when safety matters, on-policy or conservative methods may be preferred. ...

March 10, 2026 · 3 min · 589 words · codefrydev

Chapter 15: Expected SARSA

Learning objectives Implement Expected SARSA: use \(\sum_{a’} \pi(a’|s’) Q(s’,a’)\) as the target instead of \(\max_{a’} Q(s’,a’)\) or \(Q(s’,a’)\). Relate Expected SARSA to SARSA (on-policy) and Q-learning (max); it can be used on- or off-policy depending on \(\pi\). Compare update variance and learning curves with Q-learning. Concept and real-world RL Expected SARSA uses the expected next action value under a policy \(\pi\): target = \(r + \gamma \sum_{a’} \pi(a’|s’) Q(s’,a’)\). For \(\epsilon\)-greedy \(\pi\), this is \(r + \gamma [(1-\epsilon) \max_{a’} Q(s’,a’) + \epsilon \cdot \text{(uniform over actions)}]\). It reduces the variance of the update (compared to SARSA, which uses a single sample \(Q(s’,a’)\)) and can be more stable. When \(\pi\) is greedy, Expected SARSA becomes Q-learning. In practice, it is a middle ground between SARSA and Q-learning and is used in some deep RL variants. ...

March 10, 2026 · 3 min · 618 words · codefrydev

Chapter 19: Hyperparameter Tuning in Tabular RL

Learning objectives Run a grid search over learning rate \(\alpha\) and exploration \(\epsilon\) for Q-learning. Aggregate results over multiple trials (e.g. mean reward per episode) and visualize with a heatmap. Interpret which hyperparameter combinations work best and why. Concept and real-world RL Hyperparameters (e.g. \(\alpha\), \(\epsilon\), \(\gamma\)) strongly affect learning speed and final performance. Grid search tries every combination in a predefined set; it is simple but costly when there are many parameters. In practice, RL tuning often uses grid search for 2–3 key parameters, or Bayesian optimization / bandit-based tuning for larger spaces. Reporting mean and std over multiple seeds is essential because RL is noisy. Heatmaps (e.g. \(\alpha\) vs \(\epsilon\) with color = mean reward) make good and bad regions visible at a glance. ...

March 10, 2026 · 3 min · 608 words · codefrydev

Phase 3 Foundations Quiz

Use this quiz after completing Volume 1 and Volume 2 (or the Phase 3 mini-project). If you can answer at least 12 of 15 correctly, you are ready for Phase 4 and Volume 3. 1. RL framework Q: Name the four main components of an RL system (agent, environment, and two more). What is a state? Answer Agent, environment, action, reward. State: a representation of the current situation the agent uses to choose actions. 2. Return Q: For rewards [0, 0, 1] and \(\gamma = 0.9\), compute the discounted return \(G_0\) from step 0. ...

March 10, 2026 · 5 min · 876 words · codefrydev

Stock Trading Project with Reinforcement Learning

Beginners, halt! If you skipped ahead: this project assumes you have completed the core curriculum through temporal difference learning and approximation methods (e.g. Volume 2 and Volume 3 or equivalent). You should understand Q-learning, state and action spaces, and at least linear function approximation. If you have not done that yet, start with the Learning path and Course outline. Stock Trading Project Section Introduction This project walks you through building a simplified RL-based stock trading agent: you define an environment (state = market/position info, actions = buy/sell/hold), a reward (e.g. profit or risk-adjusted return), and train an agent using Q-learning with function approximation. The goal is to understand how to go from theory (Q-learning, FA) to a concrete design and code. ...

March 10, 2026 · 4 min · 717 words · codefrydev

Tabular Methods

This page covers the tabular methods you need for the preliminary assessment: policy iteration and value iteration, the difference between Monte Carlo and TD, on-policy vs off-policy learning, and the Q-learning update rule. Back to Preliminary. Why this matters for RL When the state and action spaces are small enough, we can store one value per state (or state-action) and update them from experience or from the model. Dynamic programming does this when we know the model; Monte Carlo and TD do it from samples. Q-learning is the canonical off-policy TD method and is the basis of many deep RL algorithms (e.g. DQN). You need to know how these methods differ and how to write the Q-learning update. ...

March 10, 2026 · 6 min · 1277 words · codefrydev