Chapter 62: Intrinsic Motivation

Learning objectives

Design an intrinsic reward based on state visitation count: bonus = \(1/\sqrt{\text{count}}\) (or similar) so rarely visited states are more attractive.
Implement an agent that uses total reward = extrinsic + intrinsic and compare exploration behavior (e.g. coverage of the state space) with an agent that uses only extrinsic reward.
Relate to curiosity and exploration in game AI and robot navigation.

Concept and real-world RL

Intrinsic motivation gives the agent a bonus for visiting novel or surprising states, so it explores even when extrinsic reward is sparse. Count-based bonus \(1/\sqrt{N(s)}\) (inverse square root of visit count) encourages visiting states that have been seen fewer times. In game AI and robot navigation, this can help discover the goal; in recommendation, novelty bonuses encourage diversity. The combination extrinsic + intrinsic balances exploitation (reward) and exploration (novelty).

Where you see this in practice: Count-based exploration in MDPs; pseudo-counts in Atari (e.g. Bellemare et al.); curiosity modules.

Illustration (intrinsic bonus): The bonus \(1/\sqrt{\text{count}}\) is high for rarely visited states. The chart below shows intrinsic reward for states with different visit counts.

Exercise: Design an intrinsic reward based on state visitation count. For a simple gridworld, count how many times each state is visited and give a bonus = \(1/\sqrt{\text{count}}\). Implement an agent that uses total reward = extrinsic + intrinsic. Compare exploration behavior.

Professor’s hints

Maintain a dict or array N[s] = visit count. After each step, intrinsic_r = 1/sqrt(N[s]) (or 1/(1+N[s])), then N[s] += 1. Total reward = extrinsic_r + beta * intrinsic_r; tune beta.
Compare: run an agent with only extrinsic reward (e.g. +1 at goal) and one with extrinsic + intrinsic. Plot “number of unique states visited” vs steps; the intrinsic agent should cover more of the grid.
Gridworld: small (e.g. 5×5) so you can visualize which cells get visited; goal in one corner.

Common pitfalls

Intrinsic reward too large: If beta is huge, the agent ignores the goal and only seeks novelty. Scale intrinsic so it is on the same order as extrinsic (or tune).
Counts in continuous state: For continuous states, use hashing or a density model (pseudo-count); this exercise uses discrete grid.

Worked solution (warm-up: count-based exploration)

Key idea: Count-based bonus: \(r^+_t = 1/\sqrt{N(s_t)}\) (or \(1/(1+N(s_t))\)) so the agent is rewarded for visiting states it has rarely seen. In discrete tabular settings we store \(N(s)\); in continuous we use hashing or a density model to estimate “novelty.” This encourages the agent to cover the state space and can help in sparse-reward tasks.

Extra practice

Warm-up: Why does \(1/\sqrt{N(s)}\) encourage visiting rarely visited states more than \(1/N(s)\)?
Coding: Implement count-based intrinsic reward in a 5×5 gridworld. Plot coverage (% cells visited at least once) vs steps for beta=0, 0.1, 1.0. Which explores fastest?
Challenge: Use pseudo-counts from a density model (e.g. fit a kernel density estimator on visited states). Use bonus = 1/sqrt(pseudo_count). Does it work in a continuous state space (e.g. 2D position)?
Variant: Use a finer state discretization (e.g. 10×10 bins instead of 5×5) for pseudo-counts in a 2D continuous env. How does granularity affect exploration coverage and computation?
Debug: A count-based agent achieves high coverage but final task performance is poor. The intrinsic bonus is never annealed — after the agent has fully explored, it keeps getting equal bonus for all states. What fix ensures the agent transitions from exploration to exploitation?
Conceptual: Count-based exploration works well in tabular settings but breaks in high-dimensional or continuous spaces. Explain precisely why: what property of the state space makes exact counts uninformative or impractical?