Learning objectives
- State the backpropagation algorithm as repeated application of the chain rule.
- Implement forward and backward passes for a 2-layer MLP with MSE loss in NumPy.
- Verify computed gradients against numerical finite differences.
Concept and real-world motivation
Backpropagation is the algorithm that computes how the loss changes with respect to every weight in the network. It is just the chain rule of calculus, applied efficiently from output to input. Without backpropagation, we could not train deep networks — computing gradients by finite differences would be prohibitively slow for networks with millions of parameters.
For a 2-layer network with MSE loss \(L = \frac{1}{n}|\hat{y} - y|^2\), the gradients are:
Output layer: \[\delta_2 = \frac{2}{n}(\hat{y} - y)\] \[\frac{\partial L}{\partial W_2} = \delta_2 , h_1^T \quad \text{(outer product)}\] \[\frac{\partial L}{\partial b_2} = \delta_2\]
Hidden layer (propagating error back through ReLU): \[\delta_1 = (W_2^T \delta_2) \odot \text{ReLU}’(z_1)\] \[\frac{\partial L}{\partial W_1} = \delta_1 , x^T \quad \text{(outer product)}\] \[\frac{\partial L}{\partial b_1} = \delta_1\]
where \(\text{ReLU}’(z) = \mathbb{1}[z > 0]\) (1 where the pre-activation was positive, 0 elsewhere).
This is exactly how DQN learns. The TD error \((r + \gamma \max_{a’} Q(s’, a’) - Q(s, a))^2\) is the loss, and backprop computes gradients of this loss with respect to every weight \(\theta\) in the Q-network. The optimizer (usually Adam) then updates \(\theta \leftarrow \theta - \alpha \nabla_\theta L\).
Illustration:
Exercise: Implement the full forward and backward pass for a 2-layer MLP with MSE loss. Then run a numerical gradient check to verify your backprop implementation is correct.
Professor’s hints
delta2 = (2/n) * (y_hat - y)— the factor of 2 comes from differentiating \((\hat{y}-y)^2\).dW2 = delta2.T @ h1— the outer product: delta2 is (1,2), h1 is (1,4), so delta2.T @ h1 gives (2,4). ✓- ReLU gradient:
relu_grad = (z1 > 0).astype(float)— this is 1 where z1 was positive during the forward pass. delta1 = (delta2 @ W2) * relu_grad— propagate the error back through W2, then mask with the ReLU gradient.dW1 = delta1.T @ x— delta1 is (1,4), x is (1,3), so delta1.T @ x gives (4,3). ✓
Common pitfalls
- Wrong ReLU gradient condition: Using
z1 >= 0instead ofz1 > 0. Mathematically, at z=0 the ReLU derivative is undefined (we use the subgradient 0 at exactly zero — this is fine in practice because floating point rarely hits exactly 0). - Forgetting to average over the batch: If processing n examples at once, divide by n. For n=1 this doesn’t matter, but for mini-batches it does.
- Shape errors in outer products:
delta.T @ hgives the weight gradient. Make sure the dimensions work out to(n_out, n_in)matching the weight matrix shape.
Worked solution
| |
Extra practice
- Warm-up: Implement a numerical gradient check for W1[0,0]. Perturb the weight by ±ε, compute loss for each, and estimate the gradient as (L+ − L−) / (2ε). Compare to your backprop gradient.
- Coding: Implement a full training loop: run forward pass, backward pass, and gradient descent update (
W -= lr * dW) for 100 iterations. Print the loss every 20 steps and verify it decreases. - Challenge: Extend the backprop implementation to a 3-layer network (add a second hidden layer). Write out the gradient formulas for the new layer before implementing.
- Variant: Change the loss from MSE to mean absolute error (MAE): \(L = \frac{1}{n}\sum|y - \hat{y}|\). How does the gradient \(\delta_2\) change? Implement and compare convergence speed.
- Debug: The backprop below uses the wrong condition for the ReLU gradient — it uses
>= 0instead of the subgradient convention> 0. While both work in practice, the comment in the code incorrectly claims>= 0is always correct. More importantly, there’s a sign error indelta2. Find and fix both issues.
- Conceptual: Why can’t we use the gradient 0 everywhere for the ReLU derivative (setting it to 0 at z=0 and everywhere else too)? This is the “dead neuron” argument — explain what happens to the weights when the ReLU gradient is always 0. The term “subgradient” refers to the convention we use at z=0.
- Recall: Write the four backpropagation equations for a 2-layer MSE network from memory: \(\delta_2\), \(\partial L/\partial W_2\), \(\delta_1\), and \(\partial L/\partial W_1\). What does each equation compute, and why does the order (output → input) matter?