Chapter 23: Deep Q-Networks (DQN)
Learning objectives Implement full DQN: Q-network, target network, replay buffer, \(\epsilon\)-greedy, and the TD loss (MSE to target \(r + \gamma \max_{a’} Q_{target}(s’,a’)\)). Update the target network periodically (e.g. every 100 steps) by copying the online Q-network. Train on CartPole and plot reward per episode. Concept and real-world RL DQN combines a neural network for Q-values with experience replay (store transitions, sample random minibatches to break correlation) and a target network (separate copy of the network used in the TD target, updated periodically, to stabilize learning). The agent acts \(\epsilon\)-greedy, stores \((s,a,r,s’,\text{done})\) in the buffer, and repeatedly samples a batch, computes targets using the target network, and updates the online network by minimizing MSE. DQN was the first major deep RL success (Atari) and is still a standard baseline for discrete-action tasks. ...