Learning objectives
- Combine Rainbow components: Double DQN, Dueling architecture, Prioritized replay, Noisy networks, and optionally multi-step returns (and distributional RL).
- Train on a challenging environment (e.g. Pong or another Atari-style env) and compare with a baseline DQN.
- Understand which components contribute most to sample efficiency and stability.
Concept and real-world RL
Rainbow (Hessel et al.) combines several DQN improvements: Double DQN (reduce overestimation), Dueling (value + advantage), PER (replay important transitions), Noisy nets (state-dependent exploration), multi-step returns (n-step learning), and optionally C51 (distributional RL). Together they improve sample efficiency and final performance on Atari. In practice, you do not need all components for every task; CartPole may be solved with vanilla DQN, while harder games benefit from the full stack. Implementing Rainbow is a capstone for the value-approximation volume.
Illustration (Rainbow vs DQN): With the same number of env steps, Rainbow often reaches higher reward sooner. The chart below shows mean reward per 100 episodes over 1M steps (e.g. Pong or LunarLander).
Exercise: Combine all improvements (DDQN, Dueling, PER, NoisyNet, multi-step returns, distributional RL optional) into a single Rainbow agent. Train it on a challenging environment like Pong and compare with a baseline DQN.
Professor’s hints
- Start from your best DQN (e.g. with target network and replay). Add one component at a time: first DDQN, then Dueling, then PER (with IS weights), then Noisy layers (replace \(\epsilon\)-greedy). Multi-step: use n-step returns in the target (e.g. 3-step); you will need to store n-step transitions or compute targets over n steps.
- Pong (or Atari) needs frame stacking (e.g. 4 frames) and possibly frame skip. Use a CNN if the observation is image-based. For a simpler “challenging” env, use LunarLander or a harder CartPole variant.
- Comparison: same number of env steps (e.g. 1M). Plot reward per episode (or per 100 episodes). Rainbow should reach higher performance and/or learn faster. Ablation: remove one component and see how much performance drops.
Common pitfalls
- Overfitting the baseline: Use a strong baseline (e.g. DQN with replay + target + maybe DDQN). A very weak baseline makes Rainbow look good even if the gain is from one component.
- Hyperparameters: Rainbow has more hyperparameters (PER \(\alpha\), \(\beta\), Noisy init, n-step). Tune or use published defaults; do not compare with a heavily tuned DQN vs. untuned Rainbow.
- Distributional (C51): Optional and more complex (output a distribution over returns per action, then project and minimize cross-entropy). You can skip it and still have a “Rainbow-lite” that is very effective.
Worked solution (warm-up: six components of Rainbow)
Extra practice
- Warm-up: List the six (or seven) components of Rainbow. For each, state in one sentence what problem it addresses.
- Coding: Implement a minimal “Rainbow-lite”: DQN + replay + target + Double DQN + Dueling. Train on CartPole for 20k steps. Log mean Q and episode return.
- Challenge: Ablation study: train Rainbow, then train variants with each component removed (Rainbow - DDQN, Rainbow - Dueling, etc.). Rank the components by how much removing them hurts performance.
- Variant: Add n-step returns (n=3) to your Rainbow-lite. Does multi-step help or hurt on CartPole? Try n=1 vs n=3 and compare learning speed.
- Debug: The target computation below uses \(\max\) of two Q-networks instead of \(\min\), reintroducing DDPG-style overestimation. Fix it.
- Conceptual: Which single Rainbow component typically provides the largest single improvement over vanilla DQN on Atari, and why? (Consider PER, Double DQN, and Dueling separately.)
- Recall: State in 2–3 sentences what Rainbow is: which paper introduced it, what components it combines, and what benchmark it was evaluated on.