Bandits: UCB1
Learning objectives Understand the UCB1 action-selection rule and why it explores uncertain arms. Implement UCB1 on the 10-armed testbed and compare with epsilon-greedy. Interpret the exploration bonus \(c \sqrt{\ln t / N(a)}\). Theory UCB1 (Upper Confidence Bound) chooses the action that maximizes an upper bound on the expected reward: [ a_t = \arg\max_a \left[ Q(a) + c \sqrt{\frac{\ln t}{N(a)}} \right] ] \(Q(a)\) is the sample mean reward for arm \(a\). \(N(a)\) is how many times arm \(a\) has been pulled. \(t\) is the total number of pulls so far. \(c\) is a constant (e.g. 2) that controls exploration. The term \(c \sqrt{\ln t / N(a)}\) is an exploration bonus: arms that have been pulled less often (small \(N(a)\)) get a higher bonus, so they are tried more. As \(N(a)\) grows, the bonus shrinks. So UCB1 explores systematically rather than randomly (unlike epsilon-greedy). ...