Chapter 2: Multi-Armed Bandits

Learning objectives Implement a multi-armed bandit environment with Gaussian rewards. Compare epsilon-greedy and greedy policies in terms of average reward and regret. Recognize the exploration–exploitation trade-off in a simple setting. Concept and real-world RL A multi-armed bandit is an RL problem with a single state: the agent repeatedly chooses an “arm” (action) and receives a reward drawn from a distribution associated with that arm. The goal is to maximize cumulative reward. Exploration (trying different arms) is needed to discover which arm has the highest mean; exploitation (choosing the best arm so far) maximizes immediate reward. In practice, bandits model A/B testing, clinical trials, and recommender systems (which ad or item to show). The 10-armed testbed is a standard benchmark: 10 arms with different unknown means; the agent learns from experience. ...

March 10, 2026 · 4 min · 679 words · codefrydev

Bandits: Optimistic Initial Values

Learning objectives Understand why initializing action values optimistically can encourage exploration. Implement optimistic initial values and compare with epsilon-greedy on the 10-armed testbed. Recognize when optimistic initialization helps (stationary, deterministic-ish) and when it does not (nonstationary). Theory Optimistic initial values mean we set \(Q(a)\) to a value higher than the typical reward at the start (e.g. \(Q(a) = 5\) when rewards are usually in \([-2, 2]\)). The agent then chooses the arm with the highest \(Q(a)\). After a pull, the running mean update \(\bar{Q}_{n+1} = \bar{Q}_n + \frac{1}{n+1}(r - \bar{Q}_n)\) brings \(Q(a)\) down toward the true mean. So every arm looks “good” at first; as an arm is pulled, its \(Q\) drops toward reality. The agent is naturally encouraged to try all arms before settling, which is a form of exploration without epsilon. ...

March 10, 2026 · 2 min · 305 words · codefrydev

Bandits: UCB1

Learning objectives Understand the UCB1 action-selection rule and why it explores uncertain arms. Implement UCB1 on the 10-armed testbed and compare with epsilon-greedy. Interpret the exploration bonus \(c \sqrt{\ln t / N(a)}\). Theory UCB1 (Upper Confidence Bound) chooses the action that maximizes an upper bound on the expected reward: [ a_t = \arg\max_a \left[ Q(a) + c \sqrt{\frac{\ln t}{N(a)}} \right] ] \(Q(a)\) is the sample mean reward for arm \(a\). \(N(a)\) is how many times arm \(a\) has been pulled. \(t\) is the total number of pulls so far. \(c\) is a constant (e.g. 2) that controls exploration. The term \(c \sqrt{\ln t / N(a)}\) is an exploration bonus: arms that have been pulled less often (small \(N(a)\)) get a higher bonus, so they are tried more. As \(N(a)\) grows, the bonus shrinks. So UCB1 explores systematically rather than randomly (unlike epsilon-greedy). ...

March 10, 2026 · 2 min · 319 words · codefrydev

Chapter 29: Noisy Networks for Exploration

Learning objectives Implement noisy linear layers: \(y = (W + \sigma_W \odot \epsilon_W) x + (b + \sigma_b \odot \epsilon_b)\), where \(\epsilon\) is random noise (e.g. Gaussian) and \(\sigma\) are learnable parameters. Use factorized Gaussian noise to reduce the number of random samples: e.g. \(\epsilon_{i,j} = f(\epsilon_i) \cdot f(\epsilon_j)\) with \(f\) such that the product has zero mean and unit variance. Compare exploration (e.g. unique states visited, or variance of actions over time) with \(\epsilon\)-greedy DQN. Concept and real-world RL ...

March 10, 2026 · 4 min · 642 words · codefrydev

Chapter 46: Maximum Entropy RL

Learning objectives Derive or state the maximum entropy objective: maximize \(\mathbb{E}[ \sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) ]\) (or equivalent), where \(\mathcal{H}\) is entropy. Explain how the entropy term encourages exploration: higher entropy means more uniform action distribution, so the policy tries more actions. Contrast with standard expected return maximization (no entropy bonus). Concept and real-world RL Maximum entropy RL adds an entropy bonus to the objective so the agent maximizes return and policy entropy. The optimal policy under this objective is more stochastic (explores more) and is often easier to learn (multiple modes, robustness). In robot control, SAC (Soft Actor-Critic) uses this idea with automatic temperature tuning; in game AI and recommendation, entropy regularization (e.g. in PPO) prevents the policy from becoming too deterministic too fast. The temperature \(\alpha\) (or equivalent) controls the trade-off between return and entropy. ...

March 10, 2026 · 3 min · 500 words · codefrydev

Chapter 61: The Hard Exploration Problem

Learning objectives Run DQN with ε-greedy on a sparse-reward environment (e.g. Montezuma’s Revenge if available, or a simple maze). Observe that the agent rarely discovers the first key (or goal) when rewards are sparse. Explain why sparse rewards cause failure: no learning signal until the goal is reached; random exploration is unlikely to reach it. Concept and real-world RL Hard exploration occurs when the reward is sparse (e.g. only at the goal): the agent gets no signal until it accidentally reaches the goal, which may require a long, specific sequence of actions. In game AI (Montezuma’s Revenge, Pitfall), ε-greedy DQN fails because random exploration almost never finds the key. In robot navigation and recommendation, sparse rewards (e.g. “user clicked” or “reached goal”) similarly make learning slow. This motivates intrinsic motivation, curiosity, and hierarchical methods. ...

March 10, 2026 · 3 min · 489 words · codefrydev

Chapter 62: Intrinsic Motivation

Learning objectives Design an intrinsic reward based on state visitation count: bonus = \(1/\sqrt{\text{count}}\) (or similar) so rarely visited states are more attractive. Implement an agent that uses total reward = extrinsic + intrinsic and compare exploration behavior (e.g. coverage of the state space) with an agent that uses only extrinsic reward. Relate to curiosity and exploration in game AI and robot navigation. Concept and real-world RL Intrinsic motivation gives the agent a bonus for visiting novel or surprising states, so it explores even when extrinsic reward is sparse. Count-based bonus \(1/\sqrt{N(s)}\) (inverse square root of visit count) encourages visiting states that have been seen fewer times. In game AI and robot navigation, this can help discover the goal; in recommendation, novelty bonuses encourage diversity. The combination extrinsic + intrinsic balances exploitation (reward) and exploration (novelty). ...

March 10, 2026 · 3 min · 487 words · codefrydev

Chapter 66: Go-Explore Algorithm

Learning objectives Implement a simplified Go-Explore: an archive of promising states and a strategy to return to them and explore further. Explain the two-phase idea: (1) archive states that lead to high rewards or novelty, (2) select from the archive, return to that state, then take exploratory actions. Compare Go-Explore with random exploration (e.g. episodes to reach goal, or maximum reward reached) on a deterministic maze. Identify why “return” (resetting to an archived state) helps in hard exploration compared to always starting from the initial state. Relate Go-Explore to game AI (e.g. Montezuma’s Revenge) and robot navigation with sparse goals. Concept and real-world RL ...

March 10, 2026 · 4 min · 754 words · codefrydev

RL Framework

This page covers the core RL framework you need for the preliminary assessment: the four main components, the Markov property, exploration vs exploitation, and the discount factor. Back to Preliminary. Why this matters for RL Every RL problem is defined by who acts (agent), what they interact with (environment), what they observe (state), what they can do (actions), and what feedback they get (reward). The Markov property and the discount factor shape how we define value functions and algorithms. Exploration vs exploitation is the central tension in learning from experience. ...

March 10, 2026 · 6 min · 1198 words · codefrydev