Chapter 2: Multi-Armed Bandits
Learning objectives Implement a multi-armed bandit environment with Gaussian rewards. Compare epsilon-greedy and greedy policies in terms of average reward and regret. Recognize the exploration–exploitation trade-off in a simple setting. Concept and real-world RL A multi-armed bandit is an RL problem with a single state: the agent repeatedly chooses an “arm” (action) and receives a reward drawn from a distribution associated with that arm. The goal is to maximize cumulative reward. Exploration (trying different arms) is needed to discover which arm has the highest mean; exploitation (choosing the best arm so far) maximizes immediate reward. In practice, bandits model A/B testing, clinical trials, and recommender systems (which ad or item to show). The 10-armed testbed is a standard benchmark: 10 arms with different unknown means; the agent learns from experience. ...