This page covers the probability and statistics you need for RL: expectations, variance, sample means, and the idea that sample averages converge to expectations. Back to Math for RL.
Core concepts
Random variables and expectation
A random variable \(X\) takes values according to some distribution. The expected value (or expectation) \(\mathbb{E}[X]\) is the long-run average if you repeat the experiment infinitely many times.
- For a discrete \(X\) with outcomes \(x_i\) and probabilities \(p_i\): \(\mathbb{E}[X] = \sum_i x_i p_i\).
- For a continuous distribution with density \(p(x)\): \(\mathbb{E}[X] = \int x,p(x),dx\) (you will mostly see discrete or simple continuous cases in RL).
In reinforcement learning: The return (sum of discounted rewards) is a random variable because rewards and transitions can be random. The value function \(V(s)\) is the expected return from state \(s\). Multi-armed bandits: each arm has an expected reward; we estimate it from samples.
Variance and sample variance
The variance of \(X\) is \(\mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]\); it measures spread. The sample variance of \(n\) observations \(x_1,\ldots,x_n\) is often computed with \(n-1\) in the denominator (unbiased): \(\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\), where \(\bar{x} = \frac{1}{n}\sum_i x_i\) is the sample mean.
In reinforcement learning: We use sample means to estimate expected rewards (e.g. per arm in a bandit) and sample variances to measure uncertainty. Monte Carlo methods use the sample return (average of rewards along a trajectory) to estimate the expected return.
Law of large numbers
As the number of samples \(n\) grows, the sample average \(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\) converges to the expected value \(\mathbb{E}[X]\) (under mild conditions). So we can estimate unknown expectations by averaging many observations.
In reinforcement learning: We cannot compute true \(V(s)\) in one shot; we run many episodes and average returns to get Monte Carlo estimates. Bandit algorithms use sample means of rewards per arm to estimate which arm is best.
Illustration (sample mean → expectation): As the number of samples grows, the sample average tends to get closer to the expected value. The chart below shows a typical trend: sample mean (e.g. from a bandit arm with true mean 0.4) over different sample sizes. With more pulls, the estimate stabilizes near the true mean.
Distributions you will see
- Normal (Gaussian) \(\mathcal{N}(\mu, \sigma^2)\): mean \(\mu\), variance \(\sigma^2\). Used for noise, rewards, and many continuous quantities.
- Bernoulli: 0 or 1 with probability \(p\) and \(1-p\). Used for binary outcomes (e.g. success/fail, left/right).
Practice questions
- Bandit: An arm has true expected reward 1. You get 4 samples: 1.2, 0.8, 1.0, 1.4. What is the sample mean? What is the (unbiased) sample variance?
Step 1 — Sample mean: \(\bar{x} = \frac{1}{n}\sum_i x_i = \frac{1.2 + 0.8 + 1.0 + 1.4}{4} = \frac{4.4}{4} = 1.1\). Step 2 — Deviations from mean: \(x_i - \bar{x}\) are \(0.1, -0.3, -0.1, 0.3\). Squared: \(0.01, 0.09, 0.01, 0.09\). Sum = \(0.20\). Step 3 — Unbiased sample variance: \(\frac{1}{n-1}\sum (x_i - \bar{x})^2 = \frac{0.20}{3} = 0.0\overline{6}\) (or about 0.067). Answer: Sample mean = 1.1; unbiased sample variance ≈ 0.067. Explanation: The sample mean 1.1 is our estimate of the arm’s expected reward (true value 1). We use \(n-1\) in the denominator so that on average the sample variance equals the true variance of the distribution. In bandits we use these to compare arms and to quantify uncertainty. Python: Answer and explanation
x = [1.2, 0.8, 1.0, 1.4]; mean = sum(x)/len(x); var = sum((xi-mean)**2 for xi in x)/(len(x)-1); print(mean, var) gives 1.1 and about 0.067.
The chart below shows these four sample values; the mean 1.1 is the average of the bar heights.
- Concept: In one sentence, what is the difference between \(\mathbb{E}[X]\) and the sample average of 100 draws from \(X\)? When do they coincide (in the limit)?
\(\mathbb{E}[X]\) is the theoretical long-run average (a fixed number determined by the distribution). The sample average of 100 draws is the average of one finite set of observations and is random. They coincide in the limit: as the number of draws \(n \to \infty\), the sample average converges to \(\mathbb{E}[X]\) (law of large numbers). Explanation: In RL we never know true expectations (e.g. true \(V(s)\) or true arm means). We estimate them from samples; the law of large numbers justifies why averaging many returns or rewards gives a good estimate. Python: Answer and explanation
np.random.seed(42); draws = np.random.randn(100); print(draws.mean(), np.mean(draws)) — sample mean of 100 draws; as \(n\) grows, it approaches \(\mathbb{E}[X]\).
The chart below illustrates convergence: sample mean approaches the true expectation as \(n\) increases (same idea as in Core concepts).
- By hand: For observations [0, 1, 2, 3, 4], compute the sample mean and the unbiased sample variance.
Step 1 — Sample mean: \(\bar{x} = \frac{0+1+2+3+4}{5} = \frac{10}{5} = 2\). Step 2 — Squared deviations: \((0-2)^2 = 4\), \((1-2)^2 = 1\), \((2-2)^2 = 0\), \((3-2)^2 = 1\), \((4-2)^2 = 4\). Sum = \(4+1+0+1+4 = 10\). Step 3 — Unbiased variance: \(\frac{1}{n-1}\sum (x_i - \bar{x})^2 = \frac{10}{4} = 2.5\). Answer: Sample mean = 2; unbiased sample variance = 2.5. Explanation: With \(n=5\) we divide by 4 so the variance estimate is unbiased. This is the same formula we use for reward variance in bandits or return variance in Monte Carlo. Python: Answer and explanation
x = [0,1,2,3,4]; mean = sum(x)/len(x); var = sum((xi-mean)**2 for xi in x)/(len(x)-1); print(mean, var) → 2 and 2.5.
The chart below shows the five observations and their mean (2); the spread is captured by the sample variance 2.5.
- Python: Write a function
sample_mean(x)that takes a list of numbers and returns their average. Then writesample_variance(x)that returns the unbiased variance. Test with [1, 2, 3, 4, 5].
Step 1 — sample_mean: Sum the list and divide by length. Step 2 — sample_variance: Compute mean, then \(\frac{1}{n-1}\sum (x_i - \bar{x})^2\). Explanation: We use \(n-1\) so the expected value of this statistic equals the population variance. In RL you’ll do similar operations on reward batches or returns.Answer and explanation
1
2
3
4
5
6
7
8
9
10
11
12
13
def sample_mean(x):
return sum(x) / len(x)
def sample_variance(x):
n = len(x)
mean = sample_mean(x)
squared_deviations = [(xi - mean) ** 2 for xi in x]
return sum(squared_deviations) / (n - 1)
# Test with [1, 2, 3, 4, 5]
data = [1, 2, 3, 4, 5]
print(sample_mean(data)) # 3.0
print(sample_variance(data)) # 2.5
Running the code on [1,2,3,4,5] gives mean 3 and variance 2.5. The chart below shows the five values and their mean.
- RL: Why do we need many episodes in Monte Carlo prediction to get a good value estimate? Relate your answer to the law of large numbers.
\(V(s)\) is the expected return from state \(s\); we don’t have the distribution, only samples (returns from episodes that visit \(s\)). We estimate \(V(s)\) by the sample average of those returns. The law of large numbers says this sample average converges to the expectation as the number of episodes (samples) increases. So we need many episodes so that our average is close to the true \(V(s)\). Explanation: With few episodes, the estimate is noisy; with many, it stabilizes. This is the same idea as estimating a bandit arm’s mean by averaging many pulls. Python: Simulate MC: collect returns from state \(s\) over many episodes; Answer and explanation
V_s = np.mean(returns_from_s). As N grows, V_s stabilizes.
The chart below shows a typical trend: value estimate (e.g. \(V(s)\)) stabilizes as the number of episodes increases.
- By hand: For a Bernoulli with \(p = 0.3\), what is \(\mathbb{E}[X]\) and \(\mathrm{Var}(X)\)? (E[X]=0.3, Var(X)=p(1-p)=0.21.)
Step 1 — Expectation: For Bernoulli, \(\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p = 0.3\). Step 2 — Variance: \(\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\). For Bernoulli, \(X^2 = X\), so \(\mathbb{E}[X^2] = p\). Thus \(\mathrm{Var}(X) = p - p^2 = p(1-p) = 0.3 \times 0.7 = 0.21\). Answer: \(\mathbb{E}[X] = 0.3\); \(\mathrm{Var}(X) = 0.21\). Explanation: Bernoulli is 0/1 with probability \(p\) and \(1-p\). This formula appears when we model binary outcomes (e.g. success/fail, left/right) in RL. Python: Answer and explanation
p = 0.3; E_X = p; Var_X = p*(1-p); print(E_X, Var_X) → 0.3 and 0.21.
For Bernoulli(\(p=0.3\)), \(\mathbb{E}[X]=0.3\) and \(\mathrm{Var}(X)=0.21\). The chart below shows the two outcomes and their probabilities.
- RL: In a bandit, we estimate \(Q(a)\) by the sample mean of rewards from arm \(a\). Why might we prefer this over using only the last reward from arm \(a\)?
The sample mean uses all rewards observed from arm \(a\), so it has lower variance than a single reward. One reward is noisy and can be far from the true expected reward; the sample mean averages out the noise and, by the law of large numbers, converges to \(Q(a)\). Using only the last reward would ignore previous information and give a much noisier estimate. Explanation: In bandit algorithms we maintain a running mean (or equivalent) per arm. That’s exactly the sample mean of all rewards from that arm so far. The last reward alone is an unbiased but high-variance estimate. Python: Answer and explanation
rewards_from_a = []; rewards_from_a.append(r); Q_a = sum(rewards_from_a)/len(rewards_from_a) — running sample mean. Variance of this estimate decreases as more rewards are added.
The chart below shows that the standard deviation of the estimate (of \(Q(a)\)) decreases as the number of pulls increases.
Professor’s hints
- Always use \(n-1\) (not \(n\)) in the denominator for the unbiased sample variance when you are estimating the variance of a distribution from data. Using \(n\) gives the MLE of the variance but is biased.
- In RL, “sample” usually means one trajectory or one reward draw. “Sample mean” then means averaging over many such trajectories or draws.
- When you see \(\mathbb{E}_\pi[…]\) in RL, it means “expectation under policy \(\pi\)” — i.e. over trajectories generated by following \(\pi\).
Common pitfalls
- Confusing expectation with one sample: The expected reward of an arm is not the same as the reward you got on one pull. Expectation is a property of the distribution; one sample is random.
- Using \(n\) instead of \(n-1\) for sample variance: For small \(n\) the difference matters. Stick to \(n-1\) when you want an unbiased estimate of the population variance.
- Assuming independence when it is not: In RL, consecutive rewards in one episode are often not independent. Monte Carlo still works because we average over many independent episodes.