Short problems for Volume 3. Aim for under 5 minutes per problem. All solutions are in collapsible sections.
Recall (R) — State definitions and rules
R1. Write the linear function approximation formula for V(s; w). What is φ(s)?
Answer
Linear FA: V(s; w) = w · φ(s) = Σ_i w_i φ_i(s).
φ(s) is the feature vector (also called the feature representation or basis function evaluation) for state s. Each element φ_i(s) is a feature — e.g. position, velocity, polynomial term, tile coding activation. The weights w are learned; φ(s) is fixed (hand-designed or learned separately).
The key property: V is linear in w (not necessarily in s).
R2. What is the semi-gradient TD(0) update for linear FA? How does it differ from true gradient descent?
Answer
Semi-gradient TD(0) update:
w ← w + α [R_{t+1} + γ V(S_{t+1}; w) − V(S_t; w)] · ∇_w V(S_t; w)
For linear FA: ∇_w V(S_t; w) = φ(S_t), so: w ← w + α δ_t φ(S_t).
Why semi-gradient? The TD target R_{t+1} + γ V(S_{t+1}; w) itself depends on w. True gradient descent would differentiate through the target too, giving an extra term. Semi-gradient stops the gradient through the target (treats it as a fixed label). This is biased but converges reliably for linear FA; true gradient TD exists but is more complex.
R3. What are the two key components that make DQN stable? What problem does each solve?
Answer
Replay buffer (experience replay): Stores (s, a, r, s’, done) tuples and samples random mini-batches for training. Solves: (1) correlated updates — consecutive transitions are highly correlated; random sampling breaks this, making updates more like i.i.d. supervised learning. (2) data efficiency — each experience can be reused many times.
Target network: A separate copy of the Q-network with weights θ⁻ that are updated less frequently (e.g. every C steps). The TD target uses θ⁻ instead of the current θ. Solves: moving target problem — without a frozen target, updating Q toward a target that also changes rapidly causes instability (the “chasing your own tail” problem).
R4. Write the DQN TD target y for a non-terminal transition (s, a, r, s’).
Answer
DQN TD target:
y = r + γ · max_{a’} Q(s’, a’; θ⁻)
where θ⁻ are the target network weights (frozen for C steps). For a terminal transition: y = r.
The network is trained to minimize (y − Q(s, a; θ))², using gradient descent only through Q(s,a;θ) — not through y (semi-gradient / stop-gradient on target).
R5. What is the Double DQN improvement? What bias does vanilla DQN have that Double DQN corrects?
Answer
Vanilla DQN bias: The target max_{a’} Q(s’, a’; θ⁻) uses the same network to both select the best action and evaluate its value. Because Q-values have estimation noise, taking the max over noisy estimates is a biased estimator — it overestimates Q-values (maximization bias).
Double DQN fix: Decouple selection and evaluation:
y = r + γ · Q(s’, argmax_{a’} Q(s’, a’; θ); θ⁻)
- Select the best action using the online network θ (current weights).
- Evaluate that action using the target network θ⁻.
This reduces overestimation bias, leading to more accurate value estimates and better policies.
Compute (C) — Numerical exercises
C1. Compute V(s; w) for a state with feature vector φ(s) = [1, 0.5, -0.3] and weights w = [2.0, 1.0, -1.0].
Answer
C2. Apply one semi-gradient TD(0) update. Current w = [2.0, 1.0, -1.0], φ(s) = [1, 0.5, -0.3], r = 0, φ(s’) = [0, 1, 0], γ = 0.9, α = 0.01.
Answer
V(s) = 2.8, V(s’) = 1.0. TD error δ = 0 + 0.9×1.0 − 2.8 = −1.9.
Δw = α × δ × φ(s) = 0.01 × (−1.9) × [1, 0.5, −0.3] = [−0.019, −0.0095, 0.0057].
w_new = [2.0−0.019, 1.0−0.0095, −1.0+0.0057] = [1.981, 0.9905, −0.9943].
C3. Compute the DQN target y for a non-terminal transition: r = 1, γ = 0.99. Target network Q-values for s’: [0.3, 0.7, 0.2, 0.5].
Answer
C4. Compute the Double DQN target for the same transition. Online network Q-values for s’: [0.4, 0.6, 0.8, 0.1]. Target network Q-values for s’: [0.3, 0.7, 0.2, 0.5].
Answer
Online network selects action 2 (Q=0.8). Target network evaluates it: Q_target[2] = 0.2.
Double DQN y = 1 + 0.99 × 0.2 = 1.198 vs. vanilla DQN y = 1.693.
The vanilla DQN overestimates: it picked the action with the highest target Q-value, which is inflated by noise. Double DQN is less optimistic.
C5. Huber loss vs MSE for a TD error δ = 3.0. Compute both (use δ_clip = 1.0 for Huber).
Answer
MSE = 0.5 × 3² = 4.5. Huber (δ_clip=1) = 1 × (3 − 0.5) = 2.5.
Huber is less sensitive to large TD errors. For δ=3, MSE gives a gradient of 3 while Huber gives a gradient of 1 (clipped). This makes Huber more robust to outlier transitions in the replay buffer.
Code (K) — Implementation
K1. Implement a replay buffer as a fixed-size deque that supports push and sample.
Solution
| |
deque(maxlen=capacity) automatically discards the oldest entry when the buffer is full — no manual eviction needed.
K2. Implement an epsilon decay schedule: start at ε_start=1.0, end at ε_end=0.05, decay over decay_steps steps using exponential decay.
Solution
| |
At step 0: ε=1.0. At step 10000: ε=0.05. After that: clipped at ε_end.
Debug (D) — Find and fix the bug
D1. This DQN training loop has a critical bug. Find and fix it.
| |
Answer
The bug: target_net(next_states) computes next_q but the target values (targets) are still part of the computational graph — gradients flow back through the target network during loss.backward(). This is wrong for two reasons: (1) it updates the target network (defeating its purpose); (2) it makes training unstable.
Fix: Use torch.no_grad() when computing the target:
| |
The torch.no_grad() context stops gradient computation entirely for the target. Alternatively use .detach() on next_q.
D2. Find the bug in this target network update code:
| |
Answer
The bug: the target network is updated every single step (no if guard). The whole point of the target network is to be a slowly-changing copy of the online network. Updating it every step makes it identical to the online network — no stability benefit.
Fix:
| |
Now the target network is a frozen snapshot of the online network, updated every update_every steps (e.g. every 1000 steps), providing a stable regression target.
Challenge (X)
X1. Implement a minimal DQN training loop (without PyTorch — using pure Python linear networks) on the 3×3 gridworld. Use: replay buffer (capacity 1000), target network (update every 50 steps), ε-greedy (exponential decay), mini-batch size 32. Train for 2000 steps and report final policy.
Hint
- Network: weights = [[0.0]*9 for _ in range(4)] (one weight vector per action).
- Forward: Q(s,a) = dot(weights[a], state_features(s)).
- Update: delta = Q(s,a) - y; weights[a] = [w - lrdeltaphi for w, phi in zip(weights[a], phi_s)].
- ReplayBuffer: deque(maxlen=1000), push (s,a,r,s_next,done), sample random.
- Target network: separate copy of weights, updated every 50 steps by copying.
- Training loop: reset env, ε-greedy action, step, push to buffer, sample batch, compute targets, update weights, decay ε.
Expected: policy converges to arrows toward (2,2) within ~1000 steps.