Learning objectives
- Implement the full forward pass of a 2-layer MLP in NumPy, computing and storing every intermediate value.
- Extend the forward pass to a batch of inputs using matrix operations.
- Identify the shapes of all intermediate tensors in a forward pass and explain why storing them is necessary for backpropagation.
Concept and real-world motivation
Forward propagation is the computation that turns an input into a network output. It is the inference step: given the current network weights, what is the predicted output for this input? Every time you call a trained model to make a prediction, a forward pass runs. Every training step runs forward propagation first, then backpropagation.
For a 2-layer MLP: \[z_1 = W_1 x + b_1\] \[h_1 = \text{ReLU}(z_1)\] \[z_2 = W_2 h_1 + b_2\] \[\hat{y} = \text{softmax}(z_2)\]
Each \(z_\ell\) is called the pre-activation (or logit). Each \(h_\ell\) is the post-activation (or hidden representation). The pre-activations must be stored during the forward pass because backpropagation needs them to compute gradients — specifically, the ReLU derivative \(\text{ReLU}’(z) = \mathbb{1}[z > 0]\) requires knowing the sign of each pre-activation.
During inference in DQN, the forward pass computes Q-values for all actions given the current state: \[Q(s, \cdot ; \theta) = W_3 \cdot \text{ReLU}(W_2 \cdot \text{ReLU}(W_1 s + b_1) + b_2) + b_3\] The agent then picks the action with the highest Q-value: \(a^* = \arg\max_a Q(s, a)\). This entire computation is a single forward pass.
Illustration:
Exercise: Implement the full forward pass for a 2-layer MLP. Print each intermediate value with its shape to see how the representation evolves layer by layer.
Professor’s hints
W1 @ xdoes matrix-vector multiplication (same asnp.dot(W1, x)for 2D × 1D).- After layer 1:
z1 = W1 @ x + b1has shape (4,). After ReLU:h1also has shape (4,), but some elements may be zeroed. - The shapes chain: (3,) → (4,) → (4,) → (2,) → (2,). Each arrow is one layer’s operation.
- Store all intermediate values — you’ll need them in backpropagation.
Common pitfalls
- Broadcasting error with bias: If
b1has shape(4, 1)instead of(4,), adding it toW1 @ x(shape(4,)) will broadcast incorrectly. Keep biases as 1D arrays. - Not storing intermediates: In training mode, you need
z1andz2for backprop. A clean implementation stores them in a dict or as named variables. - Applying softmax to hidden layers: Softmax goes on the output layer only (for classification). Using it in hidden layers prevents neurons from having negative pre-activations after the first pass.
Worked solution
| |
Extra practice
- Warm-up: Extend the forward pass to process 5 inputs at once (batch forward pass). Change
xto a(5, 3)matrix where each row is one input, and update the math toz1 = x @ W1.T + b1.
- Coding: Add a third hidden layer to the forward pass. Architecture: 3 → 4 → 4 → 3 → 2. Initialize
W3andb3. Print all intermediate shapes. - Challenge: Why must we store intermediate values \(z_1, h_1\) during the forward pass? Work through the backpropagation formulas for this network and identify which intermediate values are needed for each gradient computation.
- Variant: Implement the forward pass for a DQN-style network: input is a state vector of dimension 8 (like CartPole with full state), two hidden layers of 64 neurons each (ReLU), output is 2 Q-values (one per action). Use
np.random.seed(0). Print the Q-values and the greedy action.
- Debug: The forward pass below passes the bias as a 2D column vector instead of a 1D vector, causing a shape mismatch. Find and fix the bug.
- Conceptual: In a trained DQN, the hidden layer activations \(h_1\) and \(h_2\) represent learned features of the state. What might these features represent for a game like Pong? How does this differ from the raw pixel input?
- Recall: Write the four equations for a 2-layer forward pass from memory. Name every variable and give its shape for a network with input dimension 3, hidden dimension 4, and output dimension 2.