Bandits

Using optimistic initial Q-values to encourage early exploration in multi-armed bandits.

Sample mean, variance, expectation, and law of large numbers — with bandit-style problems and explained solutions.

Upper Confidence Bound (UCB1) algorithm for multi-armed bandits—balance exploration and exploitation using uncertainty.

Bayesian bandits and Thompson Sampling—sample from the posterior to balance exploration and exploitation.

When reward distributions change over time—exponential recency-weighted average and constant step size.

When to implement bandits from scratch vs. use existing libraries—learning goals and control.