Bandits: Optimistic Initial Values

Learning objectives Understand why initializing action values optimistically can encourage exploration. Implement optimistic initial values and compare with epsilon-greedy on the 10-armed testbed. Recognize when optimistic initialization helps (stationary, deterministic-ish) and when it does not (nonstationary). Theory Optimistic initial values mean we set \(Q(a)\) to a value higher than the typical reward at the start (e.g. \(Q(a) = 5\) when rewards are usually in \([-2, 2]\)). The agent then chooses the arm with the highest \(Q(a)\). After a pull, the running mean update \(\bar{Q}_{n+1} = \bar{Q}_n + \frac{1}{n+1}(r - \bar{Q}_n)\) brings \(Q(a)\) down toward the true mean. So every arm looks “good” at first; as an arm is pulled, its \(Q\) drops toward reality. The agent is naturally encouraged to try all arms before settling, which is a form of exploration without epsilon. ...

March 10, 2026 · 2 min · 305 words · codefrydev

Bandits: UCB1

Learning objectives Understand the UCB1 action-selection rule and why it explores uncertain arms. Implement UCB1 on the 10-armed testbed and compare with epsilon-greedy. Interpret the exploration bonus \(c \sqrt{\ln t / N(a)}\). Theory UCB1 (Upper Confidence Bound) chooses the action that maximizes an upper bound on the expected reward: [ a_t = \arg\max_a \left[ Q(a) + c \sqrt{\frac{\ln t}{N(a)}} \right] ] \(Q(a)\) is the sample mean reward for arm \(a\). \(N(a)\) is how many times arm \(a\) has been pulled. \(t\) is the total number of pulls so far. \(c\) is a constant (e.g. 2) that controls exploration. The term \(c \sqrt{\ln t / N(a)}\) is an exploration bonus: arms that have been pulled less often (small \(N(a)\)) get a higher bonus, so they are tried more. As \(N(a)\) grows, the bonus shrinks. So UCB1 explores systematically rather than randomly (unlike epsilon-greedy). ...

March 10, 2026 · 2 min · 319 words · codefrydev

Bandits: Thompson Sampling

Learning objectives Understand the Bayesian view: maintain a posterior over each arm’s reward distribution. Implement Thompson Sampling for Bernoulli and Gaussian rewards. Compare Thompson Sampling with epsilon-greedy and UCB1. Theory (pt 1): Bernoulli bandits Suppose each arm gives a reward 0 or 1 (e.g. click or no click). We model arm \(a\) as Bernoulli with unknown mean \(\theta_a\). A convenient prior is Beta: \(\theta_a \sim \text{Beta}(\alpha_a, \beta_a)\). After observing \(s\) successes and \(f\) failures from arm \(a\), the posterior is \(\text{Beta}(\alpha_a + s, \beta_a + f)\). ...

March 10, 2026 · 2 min · 401 words · codefrydev

Bandits: Nonstationary

Learning objectives Understand why a plain sample mean is bad when reward distributions change over time. Use exponential recency-weighted average (constant step size) for nonstationary bandits. Implement and compare fixed step size vs. sample mean on a drifting testbed. Theory In nonstationary bandits, the expected reward of each arm can change over time. The sample mean update \(\bar{Q}_{n+1} = \bar{Q}_n + \frac{1}{n+1}(r - \bar{Q}_n)\) gives equal weight to all past rewards, so old data can dominate and the agent is slow to adapt. ...

March 10, 2026 · 2 min · 363 words · codefrydev

Bandits: Why don't we just use a library?

Learning objectives Understand why this curriculum has you implement bandits (and other algorithms) from scratch. Know when it is appropriate to switch to a library in practice. Why implement from scratch? Understanding: Writing the update equations and selection rules yourself forces you to understand how they work. If you only call library.solve(), you may not know what step size, prior, or exploration rule is being used—or how to debug when things go wrong. ...

March 10, 2026 · 2 min · 289 words · codefrydev

Probability & Statistics

This page covers the probability and statistics you need for the preliminary assessment: sample mean, unbiased sample variance, expectation vs sample average, and the law of large numbers. Back to Preliminary. Why this matters for RL In reinforcement learning, rewards are often random and value functions are expected returns. Bandits, Monte Carlo methods, and policy evaluation all rely on expectations and sample averages. You need to compute and interpret sample means and variances by hand and in code. ...

March 10, 2026 · 5 min · 1062 words · codefrydev