Learning objectives
- Implement count-based exploration for discrete state spaces using a hash table and a bonus such as \(1/\sqrt{N(s)}\).
- Implement pseudo-counts from a density model (e.g. PixelCNN or simpler density estimator) for image-based states.
- Explain why pseudo-counts are needed when the state space is huge or continuous (e.g. Atari frames).
- Test count-based and pseudo-count exploration on a simple Atari-style or image-based task and compare exploration coverage.
- Relate count-based and pseudo-count methods to game AI and recommendation (e.g. diversity).
Concept and real-world RL
Count-based exploration gives a bonus for visiting states that have been seen fewer times; typically bonus \(\propto 1/\sqrt{N(s)}\) so that rarely visited states are more attractive. For discrete state spaces, \(N(s)\) can be stored in a hash table. For high-dimensional or continuous states (e.g. game AI with raw pixels), exact counts are impossible, so pseudo-counts are used: a density model is trained on visited states, and a pseudo-count is derived from the model’s prediction (e.g. via the change in density after adding the state). In recommendation, similar diversity bonuses encourage exploring under-exposed items. This chapter ties tabular count-based ideas to deep RL with images.
Where you see this in practice: Count-based exploration in MDPs; pseudo-counts in Atari (e.g. Bellemare et al.); diversity bonuses in bandits and recommenders.
Illustration (pseudo-count bonus): Count-based exploration gives higher bonus to less-visited states. The chart below shows exploration bonus (e.g. \(1/\sqrt{N}\)) for different visit counts.
Exercise: For a discrete state space, implement count-based exploration using a hash table. Use pseudo-counts derived from a density model (e.g., a PixelCNN) for image-based states. Test on a simple Atari game.
Professor’s hints
- Discrete: Maintain a dict or array for \(N(s)\); after each step, bonus = \(1/\sqrt{1+N(s)}\), then \(N(s) \leftarrow N(s)+1\). Use a hash (e.g. state tuple or flattened array) if the state is a grid or finite set.
- Pseudo-counts for images: A simple option is a kernel density estimator on a lower-dimensional embedding (e.g. from an autoencoder); or use a PixelCNN (or smaller density model) and derive pseudo-count from the model’s predicted probability (see “Unifying count-based exploration” style papers). Start with a small input size (e.g. 42×42 grayscale) if using PixelCNN.
- For “simple Atari game,” use a minimal env (e.g. one Atari game with frame stacking or a small custom image-based maze) so that training is feasible.
- Combine the exploration bonus with DQN or a policy-gradient method; add bonus to the reward before learning.
Common pitfalls
- Hash collisions: In discrete spaces with a hash, ensure the hash is consistent and that different states do not collide too often; use a proper hash of the state.
- Pseudo-count implementation: Pseudo-count formulas depend on the density model; use a standard derivation (e.g. based on learning progress or density ratio) to avoid ad hoc scaling.
- Computational cost: PixelCNN or large density models can be slow; start with a small model or lower-resolution frames.
Worked solution (warm-up: density-based bonus)
Key idea: Fit a density model \(p(s)\) (e.g. Gaussian mixture, flow, or PixelCNN on images) on visited states. Intrinsic reward can be \(-\log p(s)\) (surprise) or a pseudo-count derived from the density. Novel states have low \(p(s)\) and get high bonus. This generalizes to continuous and high-dimensional state spaces; the cost is training the density model.
Extra practice
- Warm-up: For a state visited \(N\) times, the bonus is often \(1/\sqrt{N}\). Why use the square root instead of \(1/N\)? (Think about the relative bonus for going from 1 to 2 visits vs 100 to 101.)
- Coding: Implement hash-table count-based exploration on a 5×5 gridworld with 4 actions. Plot “unique states visited” and “episodes to reach goal” over 5000 episodes. Compare with ε-greedy (ε=0.1).
- Challenge: Implement a simple pseudo-count using a small autoencoder: train it on visited states and use reconstruction error (or density in latent space) as a proxy for novelty. Use this as the exploration bonus in a simple image-based env and compare with RND.