Chapter 27: Dueling DQN

Learning objectives Implement the dueling architecture: shared backbone, then a value stream \(V(s)\) and an advantage stream \(A(s,a)\), with \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\). Understand why separating \(V\) and \(A\) can help when the value of the state is similar across actions (e.g. in safe states). Compare learning speed and final performance with standard DQN on CartPole. Concept and real-world RL In many states, the value of being in that state is similar regardless of the action (e.g. when no danger is nearby). The dueling architecture represents \(Q(s,a) = V(s) + A(s,a)\), but to get identifiability we use \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\). The network learns \(V(s)\) and \(A(s,a)\) in separate heads after a shared feature layer. This can speed up learning when the advantage (difference between actions) is small in many states. Used in Rainbow and other modern DQN variants. ...

March 10, 2026 · 3 min · 577 words · codefrydev