Chapter 23: Deep Q-Networks (DQN)

Learning objectives Implement full DQN: Q-network, target network, replay buffer, \(\epsilon\)-greedy, and the TD loss (MSE to target \(r + \gamma \max_{a’} Q_{target}(s’,a’)\)). Update the target network periodically (e.g. every 100 steps) by copying the online Q-network. Train on CartPole and plot reward per episode. Concept and real-world RL DQN combines a neural network for Q-values with experience replay (store transitions, sample random minibatches to break correlation) and a target network (separate copy of the network used in the TD target, updated periodically, to stabilize learning). The agent acts \(\epsilon\)-greedy, stores \((s,a,r,s’,\text{done})\) in the buffer, and repeatedly samples a batch, computes targets using the target network, and updates the online network by minimizing MSE. DQN was the first major deep RL success (Atari) and is still a standard baseline for discrete-action tasks. ...

March 10, 2026 · 3 min · 545 words · codefrydev

Chapter 25: Target Networks

Learning objectives Implement hard target updates: copy online network parameters to the target network every \(N\) steps. Implement soft target updates: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) each step (or each update). Compare stability of Q-value estimates and learning curves for both update rules. Concept and real-world RL The target network in DQN provides a stable TD target: we use \(Q_{target}(s’,a’)\) instead of \(Q(s’,a’)\) so that the target does not change every time we update the online network, which would cause moving targets and instability. Hard update: copy full parameters every \(N\) steps (classic DQN). Soft update: slowly track the online network: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) with small \(\tau\) (e.g. 0.001). Soft updates change the target every step but by a small amount, often yielding smoother learning. Both are used in practice (e.g. DDPG uses soft updates). ...

March 10, 2026 · 3 min · 596 words · codefrydev