Chapter 30: Rainbow DQN

Learning objectives

Combine Rainbow components: Double DQN, Dueling architecture, Prioritized replay, Noisy networks, and optionally multi-step returns (and distributional RL).
Train on a challenging environment (e.g. Pong or another Atari-style env) and compare with a baseline DQN.
Understand which components contribute most to sample efficiency and stability.

Concept and real-world RL

Rainbow (Hessel et al.) combines several DQN improvements: Double DQN (reduce overestimation), Dueling (value + advantage), PER (replay important transitions), Noisy nets (state-dependent exploration), multi-step returns (n-step learning), and optionally C51 (distributional RL). Together they improve sample efficiency and final performance on Atari. In practice, you do not need all components for every task; CartPole may be solved with vanilla DQN, while harder games benefit from the full stack. Implementing Rainbow is a capstone for the value-approximation volume.

Illustration (Rainbow vs DQN): With the same number of env steps, Rainbow often reaches higher reward sooner. The chart below shows mean reward per 100 episodes over 1M steps (e.g. Pong or LunarLander).

Exercise: Combine all improvements (DDQN, Dueling, PER, NoisyNet, multi-step returns, distributional RL optional) into a single Rainbow agent. Train it on a challenging environment like Pong and compare with a baseline DQN.

Professor’s hints

Start from your best DQN (e.g. with target network and replay). Add one component at a time: first DDQN, then Dueling, then PER (with IS weights), then Noisy layers (replace \(\epsilon\)-greedy). Multi-step: use n-step returns in the target (e.g. 3-step); you will need to store n-step transitions or compute targets over n steps.
Pong (or Atari) needs frame stacking (e.g. 4 frames) and possibly frame skip. Use a CNN if the observation is image-based. For a simpler “challenging” env, use LunarLander or a harder CartPole variant.
Comparison: same number of env steps (e.g. 1M). Plot reward per episode (or per 100 episodes). Rainbow should reach higher performance and/or learn faster. Ablation: remove one component and see how much performance drops.

Common pitfalls

Overfitting the baseline: Use a strong baseline (e.g. DQN with replay + target + maybe DDQN). A very weak baseline makes Rainbow look good even if the gain is from one component.
Hyperparameters: Rainbow has more hyperparameters (PER \(\alpha\), \(\beta\), Noisy init, n-step). Tune or use published defaults; do not compare with a heavily tuned DQN vs. untuned Rainbow.
Distributional (C51): Optional and more complex (output a distribution over returns per action, then project and minimize cross-entropy). You can skip it and still have a “Rainbow-lite” that is very effective.

Worked solution (warm-up: six components of Rainbow)

Warm-up: List the six (or seven) components of Rainbow. For each, state in one sentence what problem it addresses. Answer: (1) DQN — base: replay + target network for stability. (2) Double DQN — reduces max overestimation. (3) Prioritized replay — samples important transitions more. (4) Dueling — separates V and A for better learning when actions don’t matter much. (5) Multi-step (n-step) — uses n-step returns for less bias. (6) Noisy nets — parameter space exploration instead of \(\epsilon\)-greedy. (7) C51 (distributional) — learns return distribution. Together they improve sample efficiency and stability.

Extra practice

Warm-up: List the six (or seven) components of Rainbow. For each, state in one sentence what problem it addresses.
Coding: Implement a minimal “Rainbow-lite”: DQN + replay + target + Double DQN + Dueling. Train on CartPole for 20k steps. Log mean Q and episode return.
Challenge: Ablation study: train Rainbow, then train variants with each component removed (Rainbow - DDQN, Rainbow - Dueling, etc.). Rank the components by how much removing them hurts performance.
Variant: Add n-step returns (n=3) to your Rainbow-lite. Does multi-step help or hurt on CartPole? Try n=1 vs n=3 and compare learning speed.
Debug: The target computation below uses \(\max\) of two Q-networks instead of \(\min\), reintroducing DDPG-style overestimation. Fix it.

Try it — edit and run (Shift+Enter)

Conceptual: Which single Rainbow component typically provides the largest single improvement over vanilla DQN on Atari, and why? (Consider PER, Double DQN, and Dueling separately.)
Recall: State in 2–3 sentences what Rainbow is: which paper introduced it, what components it combines, and what benchmark it was evaluated on.