Chapter 25: Target Networks
Learning objectives Implement hard target updates: copy online network parameters to the target network every \(N\) steps. Implement soft target updates: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) each step (or each update). Compare stability of Q-value estimates and learning curves for both update rules. Concept and real-world RL The target network in DQN provides a stable TD target: we use \(Q_{target}(s’,a’)\) instead of \(Q(s’,a’)\) so that the target does not change every time we update the online network, which would cause moving targets and instability. Hard update: copy full parameters every \(N\) steps (classic DQN). Soft update: slowly track the online network: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) with small \(\tau\) (e.g. 0.001). Soft updates change the target every step but by a small amount, often yielding smoother learning. Both are used in practice (e.g. DDPG uses soft updates). ...