Bandits: Optimistic Initial Values

Learning objectives Understand why initializing action values optimistically can encourage exploration. Implement optimistic initial values and compare with epsilon-greedy on the 10-armed testbed. Recognize when optimistic initialization helps (stationary, deterministic-ish) and when it does not (nonstationary). Theory Optimistic initial values mean we set \(Q(a)\) to a value higher than the typical reward at the start (e.g. \(Q(a) = 5\) when rewards are usually in \([-2, 2]\)). The agent then chooses the arm with the highest \(Q(a)\). After a pull, the running mean update \(\bar{Q}_{n+1} = \bar{Q}_n + \frac{1}{n+1}(r - \bar{Q}_n)\) brings \(Q(a)\) down toward the true mean. So every arm looks “good” at first; as an arm is pulled, its \(Q\) drops toward reality. The agent is naturally encouraged to try all arms before settling, which is a form of exploration without epsilon. ...

March 10, 2026 · 2 min · 305 words · codefrydev