Learning objectives
- Implement RND: a fixed random target network and a predictor network that fits the target on visited states.
- Use prediction error (target output vs predictor output) as intrinsic reward for exploration.
- Explain why RND rewards novelty without learning a forward model of the environment.
- Apply RND to a hard exploration problem (e.g. Pitfall-style or sparse-reward maze) and compare with ε-greedy or count-based exploration.
- Relate RND to game AI and robot navigation where state spaces are large and rewards sparse.
Concept and real-world RL
Random Network Distillation (RND) provides an intrinsic reward by comparing the output of a fixed random network (target) with a predictor network trained to match it on visited states. States that have been visited often have low prediction error; novel states have high error and thus high intrinsic reward. Unlike ICM, RND does not use actions or a forward model—it only measures “how well can we predict this state’s fingerprint?” In game AI (e.g. Atari hard-exploration games like Pitfall), RND helps agents discover new screens; in robot navigation, it can encourage visiting new regions of the state space when the goal reward is sparse.
Where you see this in practice: RND in Atari (e.g. OpenAI); bonus-based exploration in DQN and policy-gradient methods.
Illustration (RND exploration): RND intrinsic reward is high in novel states (high prediction error). The chart below shows extrinsic vs intrinsic return over episodes (intrinsic decreases as states become familiar).
Exercise: Implement RND: a fixed random target network and a predictor network trained on visited states. Use the prediction error as intrinsic reward. Apply it to a hard exploration problem like a Pitfall-style environment.
Professor’s hints
- Target: One or more layers of a randomly initialized network (e.g. MLP or small CNN) that maps state to a fixed-dimensional vector; weights are frozen.
- Predictor: Same architecture (or similar), trained to match the target’s output on the current batch of states. Loss = MSE(target(s), predictor(s)); intrinsic reward = MSE (or its square root) so that novel states get high reward.
- If Pitfall-style env is not available, use a deterministic maze with a single distant goal or a proc-gen maze where each episode has a different layout so that “novelty” is meaningful.
- Add the intrinsic reward to the environment reward (with a coefficient); tune so the agent both explores and still cares about the goal.
Common pitfalls
- Predictor overtrains on recent states: If the predictor fits the replay buffer too well, intrinsic reward drops everywhere; use a reasonable buffer size and learning rate so the predictor generalizes to “similar” states but fails on truly new ones.
- Scaling: Raw MSE can be large or small depending on architecture; normalize or scale the intrinsic reward so it is on a similar scale to extrinsic reward.
- Continuous state spaces: RND works well with raw pixels or state vectors; for high-dimensional states, the target and predictor need enough capacity to make “novel” states distinguishable.
Worked solution (warm-up: RND)
Key idea: RND: a random fixed network maps state to a target; a predictor network is trained to match the target. Intrinsic reward = \(\|f_{target}(s) - f_{pred}(s)\|^2\). States where the predictor fails (high error) are “novel.” The target is random so it does not change; only the predictor is trained. This gives a stable novelty signal without count-based memory.
Extra practice
- Warm-up: Why does a fixed random network give different outputs for different inputs? Why would a predictor trained on visited states have higher error on unvisited states?
- Coding: Implement RND on a 10×10 maze with goal at (9,9). Plot “mean intrinsic reward” and “unique states visited” over training. Compare with the same agent without RND.
- Challenge: Combine RND with count-based exploration (e.g. use both RND bonus and \(1/\sqrt{N(s)}\)). Does the combination improve over either alone on a hard exploration task?