Learning objectives
- Implement hard target updates: copy online network parameters to the target network every \(N\) steps.
- Implement soft target updates: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) each step (or each update).
- Compare stability of Q-value estimates and learning curves for both update rules.
Concept and real-world RL
The target network in DQN provides a stable TD target: we use \(Q_{target}(s’,a’)\) instead of \(Q(s’,a’)\) so that the target does not change every time we update the online network, which would cause moving targets and instability. Hard update: copy full parameters every \(N\) steps (classic DQN). Soft update: slowly track the online network: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) with small \(\tau\) (e.g. 0.001). Soft updates change the target every step but by a small amount, often yielding smoother learning. Both are used in practice (e.g. DDPG uses soft updates).
Illustration (Q-value stability): With soft updates, the target network changes gradually, so mean Q over a batch often evolves more smoothly than with hard updates. The chart below shows typical mean Q(s,a) over training steps (soft update).
Exercise: In your DQN implementation, compare the effect of hard updates (copy every N steps) vs. soft updates (\(\tau=0.001\) update at each step). Plot the Q-value estimates over time to see stability differences.
Professor’s hints
- Hard: every N steps (e.g. 100),
target.load_state_dict(online.state_dict()). Soft: after each gradient step, for each parameterp_target,p_online, dop_target.data.copy_(tau * p_target.data + (1 - tau) * p_online.data)(or loop overzip(target.parameters(), online.parameters())). - Q-value estimates: log the mean (or max) of \(Q(s,a)\) over a fixed set of states (e.g. from a few random rollouts) or over the current batch. Plot this over training steps. With soft updates, the target changes gradually, so Q-values may evolve more smoothly.
- Run both variants for the same number of steps; plot reward per episode and (if you log it) Q-values. Soft often has less oscillation but may need tuning of \(\tau\).
Common pitfalls
- Soft update direction: \(\theta_{target} = \tau \theta_{target} + (1-\tau) \theta_{online}\). So the target moves toward the online network. \(\tau\) close to 1 means target changes slowly; \(\tau\) close to 0 means target tracks online quickly.
- In-place vs new tensor: For soft update, you must update
targetparameters in place. Do not create a new network; copy data into the existing target parameters. - Comparing fairly: Use the same replay buffer, same \(\epsilon\), same total steps. Only the target update rule should differ.
Worked solution (warm-up: soft update after 1000 steps)
Warm-up: After 1000 steps with soft update \(\tau=0.001\), roughly how much of the target parameters come from the initial target vs. the current online? Answer: Target is updated as \(w_{target} \leftarrow (1-\tau) w_{target} + \tau w_{online}\). So each step multiplies the “old target” component by \(1-\tau = 0.999\). After 1000 steps, the initial target’s contribution is \((0.999)^{1000} \approx 0.37\). So about 63% of the target is from the online network’s recent copies; the target lags but tracks the online. With \(\tau=0.001\) the target changes slowly, which stabilizes learning.
Extra practice
- Warm-up: After 1000 steps with soft update \(\tau=0.001\), roughly how much of the target parameters come from the initial target vs. the current online? (The target is an exponential moving average; after many steps it is close to the online.)
- Coding: Implement soft target update: for two PyTorch modules (online, target), do target = τ*target + (1-τ)*online (param by param). Run 100 updates with τ=0.01 and print the L2 distance between online and target params.
- Challenge: Try \(\tau \in \{0.001, 0.01, 0.1\}\) for soft updates. Plot learning curves. Which \(\tau\) is most stable? Which learns fastest?