Learning objectives
- Implement count-based exploration for discrete state spaces using a hash table and a bonus such as \(1/\sqrt{N(s)}\).
- Implement pseudo-counts from a density model (e.g. PixelCNN or simpler density estimator) for image-based states.
- Explain why pseudo-counts are needed when the state space is huge or continuous (e.g. Atari frames).
- Test count-based and pseudo-count exploration on a simple Atari-style or image-based task and compare exploration coverage.
- Relate count-based and pseudo-count methods to game AI and recommendation (e.g. diversity).
Concept and real-world RL
Count-based exploration gives a bonus for visiting states that have been seen fewer times; typically bonus \(\propto 1/\sqrt{N(s)}\) so that rarely visited states are more attractive. For discrete state spaces, \(N(s)\) can be stored in a hash table. For high-dimensional or continuous states (e.g. game AI with raw pixels), exact counts are impossible, so pseudo-counts are used: a density model is trained on visited states, and a pseudo-count is derived from the model’s prediction (e.g. via the change in density after adding the state). In recommendation, similar diversity bonuses encourage exploring under-exposed items. This chapter ties tabular count-based ideas to deep RL with images.
Where you see this in practice: Count-based exploration in MDPs; pseudo-counts in Atari (e.g. Bellemare et al.); diversity bonuses in bandits and recommenders.
Illustration (pseudo-count bonus): Count-based exploration gives higher bonus to less-visited states. The chart below shows exploration bonus (e.g. \(1/\sqrt{N}\)) for different visit counts.
Exercise: For a discrete state space, implement count-based exploration using a hash table. Use pseudo-counts derived from a density model (e.g., a PixelCNN) for image-based states. Test on a simple Atari game.
Professor’s hints
- Discrete: Maintain a dict or array for \(N(s)\); after each step, bonus = \(1/\sqrt{1+N(s)}\), then \(N(s) \leftarrow N(s)+1\). Use a hash (e.g. state tuple or flattened array) if the state is a grid or finite set.
- Pseudo-counts for images: A simple option is a kernel density estimator on a lower-dimensional embedding (e.g. from an autoencoder); or use a PixelCNN (or smaller density model) and derive pseudo-count from the model’s predicted probability (see “Unifying count-based exploration” style papers). Start with a small input size (e.g. 42×42 grayscale) if using PixelCNN.
- For “simple Atari game,” use a minimal env (e.g. one Atari game with frame stacking or a small custom image-based maze) so that training is feasible.
- Combine the exploration bonus with DQN or a policy-gradient method; add bonus to the reward before learning.
Common pitfalls
- Hash collisions: In discrete spaces with a hash, ensure the hash is consistent and that different states do not collide too often; use a proper hash of the state.
- Pseudo-count implementation: Pseudo-count formulas depend on the density model; use a standard derivation (e.g. based on learning progress or density ratio) to avoid ad hoc scaling.
- Computational cost: PixelCNN or large density models can be slow; start with a small model or lower-resolution frames.
Worked solution (warm-up: density-based bonus)
Extra practice
- Warm-up: For a state visited \(N\) times, the bonus is often \(1/\sqrt{N}\). Why use the square root instead of \(1/N\)? (Think about the relative bonus for going from 1 to 2 visits vs 100 to 101.)
- Coding: Implement hash-table count-based exploration on a 5×5 gridworld with 4 actions. Plot “unique states visited” and “episodes to reach goal” over 5000 episodes. Compare with ε-greedy (ε=0.1).
- Challenge: Implement a simple pseudo-count using a small autoencoder: train it on visited states and use reconstruction error (or density in latent space) as a proxy for novelty. Use this as the exploration bonus in a simple image-based env and compare with RND.
- Variant: Use a coarser hash (e.g. discretize continuous state to 5 bins per dimension vs 20 bins). How does hash granularity affect the exploration bonus quality? At what granularity does the bonus become useless?
- Debug: A student implements a hash-based count but uses a mutable list as the dictionary key instead of a tuple. In Python, this causes a
TypeErrorat runtime. Fix the hashing and explain why states must be converted to an immutable, hashable type. - Conceptual: Intrinsic motivation can cause an agent to keep exploring even after it has found the optimal policy. Describe two strategies for decaying the intrinsic reward over training so the agent eventually commits to the task reward.