Chapter 15: Expected SARSA

Learning objectives Implement Expected SARSA: use \(\sum_{a’} \pi(a’|s’) Q(s’,a’)\) as the target instead of \(\max_{a’} Q(s’,a’)\) or \(Q(s’,a’)\). Relate Expected SARSA to SARSA (on-policy) and Q-learning (max); it can be used on- or off-policy depending on \(\pi\). Compare update variance and learning curves with Q-learning. Concept and real-world RL Expected SARSA uses the expected next action value under a policy \(\pi\): target = \(r + \gamma \sum_{a’} \pi(a’|s’) Q(s’,a’)\). For \(\epsilon\)-greedy \(\pi\), this is \(r + \gamma [(1-\epsilon) \max_{a’} Q(s’,a’) + \epsilon \cdot \text{(uniform over actions)}]\). It reduces the variance of the update (compared to SARSA, which uses a single sample \(Q(s’,a’)\)) and can be more stable. When \(\pi\) is greedy, Expected SARSA becomes Q-learning. In practice, it is a middle ground between SARSA and Q-learning and is used in some deep RL variants. ...

March 10, 2026 · 3 min · 618 words · codefrydev

Chapter 33: The REINFORCE Algorithm

Learning objectives Implement REINFORCE (Monte Carlo policy gradient): estimate \(\nabla_\theta J\) using the return \(G_t\) from full episodes. Use a neural network policy with softmax output for discrete actions (e.g. CartPole). Observe and explain the high variance of gradient estimates when using raw returns \(G_t\) (no baseline). Concept and real-world RL REINFORCE is the simplest policy gradient algorithm: run an episode under \(\pi_\theta\), compute the return \(G_t\) from each step, and update \(\theta\) with \(\theta \leftarrow \theta + \alpha \sum_t G_t \nabla_\theta \log \pi(a_t|s_t)\). It is on-policy and Monte Carlo (needs full episodes). The variance of \(G_t\) can be large, especially in long episodes, which makes learning slow or unstable. In game AI, REINFORCE is a baseline for more advanced methods (actor-critic, PPO); in robot control, it is rarely used alone because of sample efficiency and variance. Adding a baseline (e.g. state-value function) reduces variance without introducing bias. ...

March 10, 2026 · 3 min · 602 words · codefrydev

Probability & Statistics

This page covers the probability and statistics you need for RL: expectations, variance, sample means, and the idea that sample averages converge to expectations. Back to Math for RL. Core concepts Random variables and expectation A random variable \(X\) takes values according to some distribution. The expected value (or expectation) \(\mathbb{E}[X]\) is the long-run average if you repeat the experiment infinitely many times. For a discrete \(X\) with outcomes \(x_i\) and probabilities \(p_i\): \(\mathbb{E}[X] = \sum_i x_i p_i\). For a continuous distribution with density \(p(x)\): \(\mathbb{E}[X] = \int x,p(x),dx\) (you will mostly see discrete or simple continuous cases in RL). In reinforcement learning: The return (sum of discounted rewards) is a random variable because rewards and transitions can be random. The value function \(V(s)\) is the expected return from state \(s\). Multi-armed bandits: each arm has an expected reward; we estimate it from samples. ...

March 10, 2026 · 8 min · 1699 words · codefrydev

Probability & Statistics

This page covers the probability and statistics you need for the preliminary assessment: sample mean, unbiased sample variance, expectation vs sample average, and the law of large numbers. Back to Preliminary. Why this matters for RL In reinforcement learning, rewards are often random and value functions are expected returns. Bandits, Monte Carlo methods, and policy evaluation all rely on expectations and sample averages. You need to compute and interpret sample means and variances by hand and in code. ...

March 10, 2026 · 5 min · 1062 words · codefrydev