Backpropagation: Teaching Networks by Propagating Errors

Learning objectives

State the backpropagation algorithm as repeated application of the chain rule.
Implement forward and backward passes for a 2-layer MLP with MSE loss in NumPy.
Verify computed gradients against numerical finite differences.

Concept and real-world motivation

Backpropagation is the algorithm that computes how the loss changes with respect to every weight in the network. It is just the chain rule of calculus, applied efficiently from output to input. Without backpropagation, we could not train deep networks — computing gradients by finite differences would be prohibitively slow for networks with millions of parameters.

For a 2-layer network with MSE loss \(L = \frac{1}{n}|\hat{y} - y|^2\), the gradients are:

Output layer: \[\delta_2 = \frac{2}{n}(\hat{y} - y)\] \[\frac{\partial L}{\partial W_2} = \delta_2 , h_1^T \quad \text{(outer product)}\] \[\frac{\partial L}{\partial b_2} = \delta_2\]

Hidden layer (propagating error back through ReLU): \[\delta_1 = (W_2^T \delta_2) \odot \text{ReLU}’(z_1)\] \[\frac{\partial L}{\partial W_1} = \delta_1 , x^T \quad \text{(outer product)}\] \[\frac{\partial L}{\partial b_1} = \delta_1\]

where \(\text{ReLU}’(z) = \mathbb{1}[z > 0]\) (1 where the pre-activation was positive, 0 elsewhere).

This is exactly how DQN learns. The TD error \((r + \gamma \max_{a’} Q(s’, a’) - Q(s, a))^2\) is the loss, and backprop computes gradients of this loss with respect to every weight \(\theta\) in the Q-network. The optimizer (usually Adam) then updates \(\theta \leftarrow \theta - \alpha \nabla_\theta L\).

Illustration:

Exercise: Implement the full forward and backward pass for a 2-layer MLP with MSE loss. Then run a numerical gradient check to verify your backprop implementation is correct.

Try it — edit and run (Shift+Enter)

Professor’s hints

delta2 = (2/n) * (y_hat - y) — the factor of 2 comes from differentiating \((\hat{y}-y)^2\).
dW2 = delta2.T @ h1 — the outer product: delta2 is (1,2), h1 is (1,4), so delta2.T @ h1 gives (2,4). ✓
ReLU gradient: relu_grad = (z1 > 0).astype(float) — this is 1 where z1 was positive during the forward pass.
delta1 = (delta2 @ W2) * relu_grad — propagate the error back through W2, then mask with the ReLU gradient.
dW1 = delta1.T @ x — delta1 is (1,4), x is (1,3), so delta1.T @ x gives (4,3). ✓

Common pitfalls

Wrong ReLU gradient condition: Using z1 >= 0 instead of z1 > 0. Mathematically, at z=0 the ReLU derivative is undefined (we use the subgradient 0 at exactly zero — this is fine in practice because floating point rarely hits exactly 0).
Forgetting to average over the batch: If processing n examples at once, divide by n. For n=1 this doesn’t matter, but for mini-batches it does.
Shape errors in outer products: delta.T @ h gives the weight gradient. Make sure the dimensions work out to (n_out, n_in) matching the weight matrix shape.

Worked solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
np.random.seed(42)

x = np.array([[0.5, -0.3, 0.8]])
y = np.array([[1.0, 0.0]])
W1 = np.random.randn(4, 3) * 0.5
b1 = np.zeros(4)
W2 = np.random.randn(2, 4) * 0.5
b2 = np.zeros(2)

# Forward
z1    = x @ W1.T + b1
h1    = np.maximum(0, z1)
z2    = h1 @ W2.T + b2
y_hat = z2
loss  = np.mean((y_hat - y) ** 2)

# Backward
n       = x.shape[0]
delta2  = (2/n) * (y_hat - y)      # (1, 2)
dW2     = delta2.T @ h1             # (2, 4)
db2     = delta2.sum(axis=0)        # (2,)
delta1  = (delta2 @ W2) * (z1 > 0).astype(float)  # (1, 4)
dW1     = delta1.T @ x             # (4, 3)
db1     = delta1.sum(axis=0)        # (4,)

print(f'loss={loss:.4f}, dW2 shape={dW2.shape}, dW1 shape={dW1.shape}')

Extra practice

Warm-up: Implement a numerical gradient check for W1[0,0]. Perturb the weight by ±ε, compute loss for each, and estimate the gradient as (L+ − L−) / (2ε). Compare to your backprop gradient.

Try it — edit and run (Shift+Enter)

Coding: Implement a full training loop: run forward pass, backward pass, and gradient descent update (W -= lr * dW) for 100 iterations. Print the loss every 20 steps and verify it decreases.
Challenge: Extend the backprop implementation to a 3-layer network (add a second hidden layer). Write out the gradient formulas for the new layer before implementing.
Variant: Change the loss from MSE to mean absolute error (MAE): \(L = \frac{1}{n}\sum|y - \hat{y}|\). How does the gradient \(\delta_2\) change? Implement and compare convergence speed.
Debug: The backprop below uses the wrong condition for the ReLU gradient — it uses >= 0 instead of the subgradient convention > 0. While both work in practice, the comment in the code incorrectly claims >= 0 is always correct. More importantly, there’s a sign error in delta2. Find and fix both issues.

Try it — edit and run (Shift+Enter)

Conceptual: Why can’t we use the gradient 0 everywhere for the ReLU derivative (setting it to 0 at z=0 and everywhere else too)? This is the “dead neuron” argument — explain what happens to the weights when the ReLU gradient is always 0. The term “subgradient” refers to the convention we use at z=0.
Recall: Write the four backpropagation equations for a 2-layer MSE network from memory: \(\delta_2\), \(\partial L/\partial W_2\), \(\delta_1\), and \(\partial L/\partial W_1\). What does each equation compute, and why does the order (output → input) matter?