Exploration

Overall Progress 0%

10-armed testbed with epsilon-greedy vs greedy.

Using optimistic initial Q-values to encourage early exploration in multi-armed bandits.

Upper Confidence Bound (UCB1) algorithm for multi-armed bandits—balance exploration and exploitation using uncertainty.

Agent, environment, state, action, reward, Markov property, exploration-exploitation, and discount factor — with explanations.

Noisy linear layers with factorized Gaussian; compare with ε-greedy.

Max-entropy objective; why entropy encourages exploration.

DQN with ε-greedy on Montezuma's Revenge; sparse rewards.

State visitation count bonus; exploration in gridworld.

Simplified Go-Explore on deterministic maze; archive and return.

Review Volume 6 (Model-Based RL, MCTS, Dyna-Q, world models) and preview Volume 7 (Exploration — intrinsic motivation, curiosity, and sparse rewards).

Review Volume 7 (Exploration, ICM, RND, Go-Explore, Meta-RL) and preview Volume 8 (Offline RL, Imitation Learning, RLHF).