This track covers the math you need to read and do reinforcement learning: probability & statistics, linear algebra, and calculus. Each topic is tied to how it appears in RL (bandits, value functions, gradients). Each topic page includes practice questions with full step-by-step solutions (collapsible “Answer and explanation”) so you can check your work and see the derivation. Work through the pages in order, or use them to fill gaps after the Preliminary assessment.
Recommended order: Probability & statistics → Linear algebra → Calculus.
Why this math matters in RL#
- Probability: Rewards are often random; value functions are expected returns. Bandits, Monte Carlo methods, and policy evaluation all use expectations and sample averages.
- Linear algebra: States and observations are vectors; value functions are sometimes linear in a weight vector; neural networks are built from matrix-vector products and gradients.
- Calculus: Policy gradients and loss-based updates use derivatives and the chain rule. You do not need to derive everything by hand, but you need to understand what a gradient is and how it is used.
Quick links#
| Topic | Content | RL use |
|---|
| Probability & statistics | Expectations, variance, sample mean, distributions, law of large numbers | Bandit rewards, MC returns, policy evaluation |
| Linear algebra | Vectors, dot product, matrices, gradients | State vectors, value parameterization, gradient updates |
| Calculus | Derivatives, chain rule, partial derivatives | Policy gradient, loss gradients, backprop |
After finishing this track, take the Phase 1 self-check (10 questions). If you pass, you are ready for Phase 2 and Volume 1.
This page covers the calculus you need for RL: derivatives, the chain rule, and partial derivatives. Back to Math for RL.
Core concepts Derivatives The derivative of \(f(x)\) with respect to \(x\) is \(f’(x)\) or \(\frac{df}{dx}\). It gives the rate of change (slope) of \(f\) at \(x\). Rules you will use:
\(\frac{d}{dx} x^n = n x^{n-1}\) \(\frac{d}{dx} e^x = e^x\) \(\frac{d}{dx} \ln x = \frac{1}{x}\) \(\frac{d}{dx} \ln(1 + e^x) = \frac{e^x}{1+e^x} = \sigma(x)\) (sigmoid) The chart below shows the sigmoid \(\sigma(x) = \frac{e^x}{1+e^x}\): the S-shaped function whose derivative we use in policy parameterizations and softplus.
...
This page covers the linear algebra you need for RL: vectors, dot products, matrices, matrix-vector multiplication, and the idea of gradients. Back to Math for RL.
Core concepts Vectors A vector is an ordered list of numbers, e.g. \(x = [x_1, x_2, x_3]^T\) (column vector). We treat it as a column by default. The dot product of two vectors \(x\) and \(y\) of the same length is \(x^T y = \sum_i x_i y_i\). Geometrically, it is related to the angle between the vectors and their lengths: \(x^T y = |x| |y| \cos\theta\).
...
This page covers the probability and statistics you need for RL: expectations, variance, sample means, and the idea that sample averages converge to expectations. Back to Math for RL.
Core concepts Random variables and expectation A random variable \(X\) takes values according to some distribution. The expected value (or expectation) \(\mathbb{E}[X]\) is the long-run average if you repeat the experiment infinitely many times.
For a discrete \(X\) with outcomes \(x_i\) and probabilities \(p_i\): \(\mathbb{E}[X] = \sum_i x_i p_i\). For a continuous distribution with density \(p(x)\): \(\mathbb{E}[X] = \int x,p(x),dx\) (you will mostly see discrete or simple continuous cases in RL). In reinforcement learning: The return (sum of discounted rewards) is a random variable because rewards and transitions can be random. The value function \(V(s)\) is the expected return from state \(s\). Multi-armed bandits: each arm has an expected reward; we estimate it from samples.
...