Exploration

10-armed testbed with epsilon-greedy vs greedy.

Using optimistic initial Q-values to encourage early exploration in multi-armed bandits.

Go to Bandits: Optimistic Initial Values →

Upper Confidence Bound (UCB1) algorithm for multi-armed bandits—balance exploration and exploitation using uncertainty.

Agent, environment, state, action, reward, Markov property, exploration-exploitation, and discount factor — with explanations.

Go to RL Framework →

Noisy linear layers with factorized Gaussian; compare with ε-greedy.

Go to Chapter 29: Noisy Networks for Exploration →

Max-entropy objective; why entropy encourages exploration.

Go to Chapter 46: Maximum Entropy RL →

DQN with ε-greedy on Montezuma's Revenge; sparse rewards.

Go to Chapter 61: The Hard Exploration Problem →

State visitation count bonus; exploration in gridworld.

Go to Chapter 62: Intrinsic Motivation →

Simplified Go-Explore on deterministic maze; archive and return.

Go to Chapter 66: Go-Explore Algorithm →

Review Volume 6 (Model-Based RL, MCTS, Dyna-Q, world models) and preview Volume 7 (Exploration — intrinsic motivation, curiosity, and sparse rewards).

Go to Volume 6 Review & Bridge to Volume 7 →

Review Volume 7 (Exploration, ICM, RND, Go-Explore, Meta-RL) and preview Volume 8 (Offline RL, Imitation Learning, RLHF).

Go to Volume 7 Review & Bridge to Volume 8 →

Exploration

Chapter 2: Multi-Armed Bandits

Bandits: Optimistic Initial Values

Bandits: UCB1

RL Framework

Chapter 29: Noisy Networks for Exploration

Chapter 46: Maximum Entropy RL

Chapter 61: The Hard Exploration Problem

Chapter 62: Intrinsic Motivation

Chapter 66: Go-Explore Algorithm

Volume 6 Review & Bridge to Volume 7

Volume 7 Review & Bridge to Volume 8