Open drills notebook (interactive)

Short problems for Volume 3. Aim for under 5 minutes per problem. All solutions are in collapsible sections.


Recall (R) — State definitions and rules

R1. Write the linear function approximation formula for V(s; w). What is φ(s)?

Answer

Linear FA: V(s; w) = w · φ(s) = Σ_i w_i φ_i(s).

φ(s) is the feature vector (also called the feature representation or basis function evaluation) for state s. Each element φ_i(s) is a feature — e.g. position, velocity, polynomial term, tile coding activation. The weights w are learned; φ(s) is fixed (hand-designed or learned separately).

The key property: V is linear in w (not necessarily in s).


R2. What is the semi-gradient TD(0) update for linear FA? How does it differ from true gradient descent?

Answer

Semi-gradient TD(0) update:

w ← w + α [R_{t+1} + γ V(S_{t+1}; w) − V(S_t; w)] · ∇_w V(S_t; w)

For linear FA: ∇_w V(S_t; w) = φ(S_t), so: w ← w + α δ_t φ(S_t).

Why semi-gradient? The TD target R_{t+1} + γ V(S_{t+1}; w) itself depends on w. True gradient descent would differentiate through the target too, giving an extra term. Semi-gradient stops the gradient through the target (treats it as a fixed label). This is biased but converges reliably for linear FA; true gradient TD exists but is more complex.


R3. What are the two key components that make DQN stable? What problem does each solve?

Answer
  1. Replay buffer (experience replay): Stores (s, a, r, s’, done) tuples and samples random mini-batches for training. Solves: (1) correlated updates — consecutive transitions are highly correlated; random sampling breaks this, making updates more like i.i.d. supervised learning. (2) data efficiency — each experience can be reused many times.

  2. Target network: A separate copy of the Q-network with weights θ⁻ that are updated less frequently (e.g. every C steps). The TD target uses θ⁻ instead of the current θ. Solves: moving target problem — without a frozen target, updating Q toward a target that also changes rapidly causes instability (the “chasing your own tail” problem).


R4. Write the DQN TD target y for a non-terminal transition (s, a, r, s’).

Answer

DQN TD target:

y = r + γ · max_{a’} Q(s’, a’; θ⁻)

where θ⁻ are the target network weights (frozen for C steps). For a terminal transition: y = r.

The network is trained to minimize (y − Q(s, a; θ))², using gradient descent only through Q(s,a;θ) — not through y (semi-gradient / stop-gradient on target).


R5. What is the Double DQN improvement? What bias does vanilla DQN have that Double DQN corrects?

Answer

Vanilla DQN bias: The target max_{a’} Q(s’, a’; θ⁻) uses the same network to both select the best action and evaluate its value. Because Q-values have estimation noise, taking the max over noisy estimates is a biased estimator — it overestimates Q-values (maximization bias).

Double DQN fix: Decouple selection and evaluation:

y = r + γ · Q(s’, argmax_{a’} Q(s’, a’; θ); θ⁻)

  • Select the best action using the online network θ (current weights).
  • Evaluate that action using the target network θ⁻.

This reduces overestimation bias, leading to more accurate value estimates and better policies.


Compute (C) — Numerical exercises

C1. Compute V(s; w) for a state with feature vector φ(s) = [1, 0.5, -0.3] and weights w = [2.0, 1.0, -1.0].

Try it — edit and run (Shift+Enter)
Answer
V(s; w) = 2.0×1 + 1.0×0.5 + (−1.0)×(−0.3) = 2.0 + 0.5 + 0.3 = 2.8.

C2. Apply one semi-gradient TD(0) update. Current w = [2.0, 1.0, -1.0], φ(s) = [1, 0.5, -0.3], r = 0, φ(s’) = [0, 1, 0], γ = 0.9, α = 0.01.

Try it — edit and run (Shift+Enter)
Answer

V(s) = 2.8, V(s’) = 1.0. TD error δ = 0 + 0.9×1.0 − 2.8 = −1.9.

Δw = α × δ × φ(s) = 0.01 × (−1.9) × [1, 0.5, −0.3] = [−0.019, −0.0095, 0.0057].

w_new = [2.0−0.019, 1.0−0.0095, −1.0+0.0057] = [1.981, 0.9905, −0.9943].


C3. Compute the DQN target y for a non-terminal transition: r = 1, γ = 0.99. Target network Q-values for s’: [0.3, 0.7, 0.2, 0.5].

Try it — edit and run (Shift+Enter)
Answer
y = 1 + 0.99 × max([0.3, 0.7, 0.2, 0.5]) = 1 + 0.99 × 0.7 = 1.693.

C4. Compute the Double DQN target for the same transition. Online network Q-values for s’: [0.4, 0.6, 0.8, 0.1]. Target network Q-values for s’: [0.3, 0.7, 0.2, 0.5].

Try it — edit and run (Shift+Enter)
Answer

Online network selects action 2 (Q=0.8). Target network evaluates it: Q_target[2] = 0.2.

Double DQN y = 1 + 0.99 × 0.2 = 1.198 vs. vanilla DQN y = 1.693.

The vanilla DQN overestimates: it picked the action with the highest target Q-value, which is inflated by noise. Double DQN is less optimistic.


C5. Huber loss vs MSE for a TD error δ = 3.0. Compute both (use δ_clip = 1.0 for Huber).

Try it — edit and run (Shift+Enter)
Answer

MSE = 0.5 × 3² = 4.5. Huber (δ_clip=1) = 1 × (3 − 0.5) = 2.5.

Huber is less sensitive to large TD errors. For δ=3, MSE gives a gradient of 3 while Huber gives a gradient of 1 (clipped). This makes Huber more robust to outlier transitions in the replay buffer.


Code (K) — Implementation

K1. Implement a replay buffer as a fixed-size deque that supports push and sample.

Try it — edit and run (Shift+Enter)
Solution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, transition):
        self.buffer.append(transition)

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

    def __len__(self):
        return len(self.buffer)

deque(maxlen=capacity) automatically discards the oldest entry when the buffer is full — no manual eviction needed.


K2. Implement an epsilon decay schedule: start at ε_start=1.0, end at ε_end=0.05, decay over decay_steps steps using exponential decay.

Try it — edit and run (Shift+Enter)
Solution
1
2
3
def get_epsilon(step, eps_start=1.0, eps_end=0.05, decay_steps=10000):
    decay = math.exp(math.log(eps_end / eps_start) / decay_steps)
    return max(eps_end, eps_start * decay**step)

At step 0: ε=1.0. At step 10000: ε=0.05. After that: clipped at ε_end.


Debug (D) — Find and fix the bug

D1. This DQN training loop has a critical bug. Find and fix it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def train_step(online_net, target_net, batch, gamma=0.99, optimizer=None):
    states, actions, rewards, next_states, dones = zip(*batch)

    # Compute target
    next_q = target_net(next_states)
    targets = [r + gamma * max(nq) if not d else r
               for r, nq, d in zip(rewards, next_q, dones)]

    # Compute prediction
    pred_q = online_net(states)
    pred = [pred_q[i][a] for i, a in enumerate(actions)]

    loss = mse_loss(pred, targets)
    optimizer.zero_grad()
    loss.backward()   # Bug: gradients flow through targets too!
    optimizer.step()
Answer

The bug: target_net(next_states) computes next_q but the target values (targets) are still part of the computational graph — gradients flow back through the target network during loss.backward(). This is wrong for two reasons: (1) it updates the target network (defeating its purpose); (2) it makes training unstable.

Fix: Use torch.no_grad() when computing the target:

1
2
3
4
with torch.no_grad():
    next_q = target_net(next_states)
    targets = [r + gamma * nq.max().item() if not d else r
               for r, nq, d in zip(rewards, next_q, dones)]

The torch.no_grad() context stops gradient computation entirely for the target. Alternatively use .detach() on next_q.


D2. Find the bug in this target network update code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class DQNAgent:
    def __init__(self):
        self.online_net = QNetwork()
        self.target_net = QNetwork()
        self.step_count = 0
        self.update_every = 1000  # update target every 1000 steps

    def learn(self, batch):
        # ... compute loss, do gradient step ...
        self.step_count += 1
        # Update target network
        self.target_net.load_state_dict(self.online_net.state_dict())  # Bug!
Answer

The bug: the target network is updated every single step (no if guard). The whole point of the target network is to be a slowly-changing copy of the online network. Updating it every step makes it identical to the online network — no stability benefit.

Fix:

1
2
if self.step_count % self.update_every == 0:
    self.target_net.load_state_dict(self.online_net.state_dict())

Now the target network is a frozen snapshot of the online network, updated every update_every steps (e.g. every 1000 steps), providing a stable regression target.


Challenge (X)

X1. Implement a minimal DQN training loop (without PyTorch — using pure Python linear networks) on the 3×3 gridworld. Use: replay buffer (capacity 1000), target network (update every 50 steps), ε-greedy (exponential decay), mini-batch size 32. Train for 2000 steps and report final policy.

Try it — edit and run (Shift+Enter)
Hint
  1. Network: weights = [[0.0]*9 for _ in range(4)] (one weight vector per action).
  2. Forward: Q(s,a) = dot(weights[a], state_features(s)).
  3. Update: delta = Q(s,a) - y; weights[a] = [w - lrdeltaphi for w, phi in zip(weights[a], phi_s)].
  4. ReplayBuffer: deque(maxlen=1000), push (s,a,r,s_next,done), sample random.
  5. Target network: separate copy of weights, updated every 50 steps by copying.
  6. Training loop: reset env, ε-greedy action, step, push to buffer, sample batch, compute targets, update weights, decay ε.

Expected: policy converges to arrows toward (2,2) within ~1000 steps.