Bandits: Nonstationary

Learning objectives Understand why a plain sample mean is bad when reward distributions change over time. Use exponential recency-weighted average (constant step size) for nonstationary bandits. Implement and compare fixed step size vs. sample mean on a drifting testbed. Theory In nonstationary bandits, the expected reward of each arm can change over time. The sample mean update \(\bar{Q}_{n+1} = \bar{Q}_n + \frac{1}{n+1}(r - \bar{Q}_n)\) gives equal weight to all past rewards, so old data can dominate and the agent is slow to adapt. ...

March 10, 2026 · 2 min · 363 words · codefrydev