Exploration
10-armed testbed with epsilon-greedy vs greedy.
Using optimistic initial Q-values to encourage early exploration in multi-armed bandits.
Upper Confidence Bound (UCB1) algorithm for multi-armed bandits—balance exploration and exploitation using uncertainty.
Agent, environment, state, action, reward, Markov property, exploration-exploitation, and discount factor — with explanations.
Noisy linear layers with factorized Gaussian; compare with ε-greedy.
Max-entropy objective; why entropy encourages exploration.
DQN with ε-greedy on Montezuma's Revenge; sparse rewards.
State visitation count bonus; exploration in gridworld.
Simplified Go-Explore on deterministic maze; archive and return.
Review Volume 6 (Model-Based RL, MCTS, Dyna-Q, world models) and preview Volume 7 (Exploration — intrinsic motivation, curiosity, and sparse rewards).
Review Volume 7 (Exploration, ICM, RND, Go-Explore, Meta-RL) and preview Volume 8 (Offline RL, Imitation Learning, RLHF).