From the reinforcement learning framework and multi-armed bandits through MDPs, value functions, Bellman equations, and dynamic programming (policy evaluation, policy iteration, value iteration). Chapters 1–10.
Bandits: Nonstationary
Learning objectives Understand why a plain sample mean is bad when reward distributions change over time. Use exponential recency-weighted average (constant step size) for nonstationary bandits. Implement and compare fixed step size vs. sample mean on a drifting testbed. Theory In nonstationary bandits, the expected reward of each arm can change over time. The sample mean update \(\bar{Q}_{n+1} = \bar{Q}_n + \frac{1}{n+1}(r - \bar{Q}_n)\) gives equal weight to all past rewards, so old data can dominate and the agent is slow to adapt. ...