Chapter 27: Dueling DQN

Learning objectives Implement the dueling architecture: shared backbone, then a value stream \(V(s)\) and an advantage stream \(A(s,a)\), with \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\). Understand why separating \(V\) and \(A\) can help when the value of the state is similar across actions (e.g. in safe states). Compare learning speed and final performance with standard DQN on CartPole. Concept and real-world RL In many states, the value of being in that state is similar regardless of the action (e.g. when no danger is nearby). The dueling architecture represents \(Q(s,a) = V(s) + A(s,a)\), but to get identifiability we use \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\). The network learns \(V(s)\) and \(A(s,a)\) in separate heads after a shared feature layer. This can speed up learning when the advantage (difference between actions) is small in many states. Used in Rainbow and other modern DQN variants. ...

March 10, 2026 · 3 min · 577 words · codefrydev

Chapter 35: Actor-Critic Architectures

Learning objectives Sketch the architecture of a two-network actor-critic: actor (policy \(\pi(a|s)\)) and critic (value \(V(s)\) or \(Q(s,a)\)). Write pseudocode for the update steps using the TD error \(\delta = r + \gamma V(s’) - V(s)\) as the advantage for the policy. Explain why the critic reduces variance compared to using Monte Carlo returns \(G_t\). Concept and real-world RL Actor-critic methods maintain two networks: the actor selects actions from \(\pi(a|s;\theta)\), and the critic estimates the value function \(V(s;w)\) (or Q). The TD error \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is a one-step estimate of the advantage; it is biased (because V is approximate) but much lower variance than \(G_t\). The actor is updated with \(\nabla \log \pi(a_t|s_t) , \delta_t\); the critic is updated to minimize \((r_t + \gamma V(s_{t+1}) - V(s_t))^2\). In robot control and game AI, actor-critic allows online, step-by-step updates instead of waiting for episode end, which speeds up learning. ...

March 10, 2026 · 3 min · 577 words · codefrydev