Learning objectives
- Understand why a plain sample mean is bad when reward distributions change over time.
- Use exponential recency-weighted average (constant step size) for nonstationary bandits.
- Implement and compare fixed step size vs. sample mean on a drifting testbed.
Theory
In nonstationary bandits, the expected reward of each arm can change over time. The sample mean update \(\bar{Q}_{n+1} = \bar{Q}_n + \frac{1}{n+1}(r - \bar{Q}_n)\) gives equal weight to all past rewards, so old data can dominate and the agent is slow to adapt.
Exponential recency-weighted average: Use a constant step size \(\alpha \in (0, 1]\):
[ Q_{n+1}(a) = Q_n(a) + \alpha \bigl( r - Q_n(a) \bigr) ]
This gives more weight to recent rewards. Equivalently, \(Q_n(a)\) is a weighted average of past rewards with weights that decay exponentially backward in time. So the agent adapts to changes in the environment.
Choosing \(\alpha\): Larger \(\alpha\) adapts faster but is noisier; smaller \(\alpha\) is smoother but slower to react. For truly nonstationary problems, a constant \(\alpha\) (e.g. 0.1) is often better than \(1/n\) step sizes.
Bandit summary, real data, and online learning
- Stationary: Sample mean or small constant \(\alpha\) both work.
- Nonstationary: Use constant \(\alpha\) (or other recency weighting).
- Real data: Often nonstationary (user preferences drift, ad click rates change). Online learning with step sizes is standard.
- Summary: Epsilon-greedy, UCB1, and Thompson Sampling can all be combined with constant step size when the environment is nonstationary.
Beginner’s exercise prompt
Implement a nonstationary 10-armed testbed: at each step, add a small random drift to each arm’s mean (e.g. \(\mu_a \leftarrow \mu_a + \mathcal{N}(0, 0.01)\)). Run an epsilon-greedy agent with (1) sample mean update \(1/n\), and (2) constant step size \(\alpha = 0.1\). Plot average reward over time. You should see that constant \(\alpha\) tracks the changing best arm better.
Code sketch
- For constant step size: \(Q(a) \leftarrow Q(a) + \alpha (r - Q(a))\) after each pull of arm \(a\).
- For the nonstationary environment: after each step, add Gaussian noise to each arm’s true mean (and optionally clip to a range).
See Chapter 2: Multi-Armed Bandits for the stationary testbed and Bandits: Why not use a library? for when to code from scratch vs. use a library.