Bandits: Thompson Sampling

Learning objectives Understand the Bayesian view: maintain a posterior over each arm’s reward distribution. Implement Thompson Sampling for Bernoulli and Gaussian rewards. Compare Thompson Sampling with epsilon-greedy and UCB1. Theory (pt 1): Bernoulli bandits Suppose each arm gives a reward 0 or 1 (e.g. click or no click). We model arm \(a\) as Bernoulli with unknown mean \(\theta_a\). A convenient prior is Beta: \(\theta_a \sim \text{Beta}(\alpha_a, \beta_a)\). After observing \(s\) successes and \(f\) failures from arm \(a\), the posterior is \(\text{Beta}(\alpha_a + s, \beta_a + f)\). ...

March 10, 2026 · 2 min · 401 words · codefrydev