Chapter 27: Dueling DQN

Learning objectives

Implement the dueling architecture: shared backbone, then a value stream \(V(s)\) and an advantage stream \(A(s,a)\), with \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\).
Understand why separating \(V\) and \(A\) can help when the value of the state is similar across actions (e.g. in safe states).
Compare learning speed and final performance with standard DQN on CartPole.

Concept and real-world RL

In many states, the value of being in that state is similar regardless of the action (e.g. when no danger is nearby). The dueling architecture represents \(Q(s,a) = V(s) + A(s,a)\), but to get identifiability we use \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\). The network learns \(V(s)\) and \(A(s,a)\) in separate heads after a shared feature layer. This can speed up learning when the advantage (difference between actions) is small in many states. Used in Rainbow and other modern DQN variants.

Illustration (dueling vs standard DQN): Dueling architecture often reaches high reward in fewer episodes. The chart below shows typical episode return over training (CartPole).

Exercise: Implement the dueling architecture: a shared base, then two streams for value \(V(s)\) and advantage \(A(s,a)\), aggregated via \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|\mathcal{A}|}\sum_a A(s,a))\). Train it on CartPole and compare learning curves with standard DQN.

Professor’s hints

Forward: features = backbone(state); V = value_head(features) (shape batch × 1); A = advantage_head(features) (shape batch × n_actions). Then Q = V + (A - A.mean(dim=1, keepdim=True)). So \(Q(s,a) = V(s) + A(s,a) - \bar{A}(s)\).
The subtraction ensures that \(Q(s,a)\) is identifiable: if you add a constant to all \(A(s,a)\), \(Q\) is unchanged. So the network can learn \(V(s)\) without duplicating the same constant in every \(A(s,a)\).
Same training loop as DQN (replay, target, TD loss). Only the network architecture changes. Compare: plot reward per episode for dueling vs standard; dueling often converges faster.

Common pitfalls

Wrong aggregation: Use \(A - \text{mean}_a A(s,a)\), not \(A - \max_a A(s,a)\) (that is another valid formulation but the mean is standard). The mean makes the layer output interpretable as advantages.
Shape of V: \(V(s)\) is (batch, 1); broadcast to (batch, n_actions) when adding to \(A\). So Q = V + (A - A.mean(dim=1, keepdim=True)) gives (batch, n_actions).
Backbone: The shared “base” can be the same as your current DQN (e.g. two Linear+ReLU layers); then split into value_head (Linear → 1) and advantage_head (Linear → n_actions).

Worked solution (warm-up: Dueling Q in terms of V and A)

Warm-up: For 2 actions, write \(Q(s,0)\) and \(Q(s,1)\) in terms of \(V(s)\), \(A(s,0)\), \(A(s,1)\). Show \(\max_a Q(s,a) = V(s) + \max_a A(s,a) - \bar{A}\). Answer: We have \(Q(s,a) = V(s) + (A(s,a) - \bar{A})\) where \(\bar{A} = \frac{1}{2}(A(s,0)+A(s,1))\). So \(Q(s,0) = V + A(s,0) - \bar{A}\), \(Q(s,1) = V + A(s,1) - \bar{A}\). Then \(\max_a Q(s,a) = V + \max_a A(s,a) - \bar{A}\) (the constant \(-\bar{A}\) doesn’t change the argmax). The dueling architecture learns V and A separately so the value stream can be learned without per-action detail when actions don’t matter much.

Extra practice

Warm-up: For 2 actions, write \(Q(s,0)\) and \(Q(s,1)\) in terms of \(V(s)\), \(A(s,0)\), \(A(s,1)\). Show that \(\max_a Q(s,a) = V(s) + \max_a A(s,a) - \bar{A}\).
Coding: Implement a Dueling DQN head: one stream for V(s), one for A(s,a); output Q(s,a) = V(s) + (A(s,a) - mean_a A(s,a)). Train on CartPole and plot return vs DQN.
Challenge: Try the alternative: \(Q(s,a) = V(s) + (A(s,a) - \max_{a’} A(s,a’))\). Train and compare with the mean version. Does it change learning?