Chapter 26: Double DQN (DDQN)
Learning objectives Implement Double DQN: use the online network to choose \(a^* = \arg\max_a Q_{online}(s’,a)\), then use \(Q_{target}(s’, a^*)\) as the TD target (instead of \(\max_a Q_{target}(s’,a)\)). Understand why this reduces overestimation of Q-values (max of estimates is biased high). Compare average Q-values and reward curves with standard DQN on CartPole. Concept and real-world RL Standard DQN uses \(y = r + \gamma \max_{a’} Q_{target}(s’,a’)\). The max over noisy estimates is biased upward (overestimation), which can hurt learning. Double DQN decouples action selection from evaluation: the online network selects \(a^\), the target network evaluates \(Q_{target}(s’, a^)\). This reduces overestimation and often improves stability and final performance. It is a small code change and is commonly used in modern DQN variants (e.g. Rainbow). ...