Chapter 26: Double DQN (DDQN)

Learning objectives Implement Double DQN: use the online network to choose \(a^* = \arg\max_a Q_{online}(s’,a)\), then use \(Q_{target}(s’, a^*)\) as the TD target (instead of \(\max_a Q_{target}(s’,a)\)). Understand why this reduces overestimation of Q-values (max of estimates is biased high). Compare average Q-values and reward curves with standard DQN on CartPole. Concept and real-world RL Standard DQN uses \(y = r + \gamma \max_{a’} Q_{target}(s’,a’)\). The max over noisy estimates is biased upward (overestimation), which can hurt learning. Double DQN decouples action selection from evaluation: the online network selects \(a^\), the target network evaluates \(Q_{target}(s’, a^)\). This reduces overestimation and often improves stability and final performance. It is a small code change and is commonly used in modern DQN variants (e.g. Rainbow). ...

March 10, 2026 · 3 min · 523 words · codefrydev

Chapter 30: Rainbow DQN

Learning objectives Combine Rainbow components: Double DQN, Dueling architecture, Prioritized replay, Noisy networks, and optionally multi-step returns (and distributional RL). Train on a challenging environment (e.g. Pong or another Atari-style env) and compare with a baseline DQN. Understand which components contribute most to sample efficiency and stability. Concept and real-world RL Rainbow (Hessel et al.) combines several DQN improvements: Double DQN (reduce overestimation), Dueling (value + advantage), PER (replay important transitions), Noisy nets (state-dependent exploration), multi-step returns (n-step learning), and optionally C51 (distributional RL). Together they improve sample efficiency and final performance on Atari. In practice, you do not need all components for every task; CartPole may be solved with vanilla DQN, while harder games benefit from the full stack. Implementing Rainbow is a capstone for the value-approximation volume. ...

March 10, 2026 · 3 min · 586 words · codefrydev