Learning objectives
- Describe the architecture of a multi-layer perceptron and name each component.
- Count the total number of trainable parameters in an MLP given its layer sizes.
- Implement the forward pass of a small MLP in NumPy using pre-given weights and verify it solves XOR.
Concept and real-world motivation
A single perceptron can only draw a straight line through the input space — it can solve AND and OR, but not XOR. The solution is to stack multiple layers: each layer transforms the input into a new representation, and successive non-linear transformations can carve out arbitrarily complex decision boundaries.
A multi-layer perceptron (MLP) consists of:
- An input layer (no computation, just the raw features \(x\))
- One or more hidden layers, each applying \(h = f(Wx + b)\) where \(W\) is a weight matrix, \(b\) is a bias vector, and \(f\) is an activation function
- An output layer that produces the network’s prediction
Each layer \(\ell\) has a weight matrix \(W_\ell \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\) and a bias vector \(b_\ell \in \mathbb{R}^{n_\ell}\). The total parameter count for one layer connecting \(n_{in}\) neurons to \(n_{out}\) neurons is \(n_{out} \times n_{in} + n_{out}\) (weights plus biases).
The Q-network in DQN is typically a 3-layer MLP: input layer = state representation, two hidden layers with 64 or 512 ReLU neurons, output layer = one Q-value per action. When Atari DQN processes pixel frames, a convolutional front-end precedes the MLP, but the final layers are exactly this structure.
Architecture:
Exercise: Count the parameters in a 3-layer MLP, then initialize the weight matrices and bias vectors with np.random.randn. Verify the shapes are correct.
Professor’s hints
- Layer 1: \(8 \times 4 = 32\) weights + 8 biases = 40 parameters.
- Layer 2: \(4 \times 8 = 32\) weights + 4 biases = 36 parameters.
- Layer 3: \(2 \times 4 = 8\) weights + 2 biases = 10 parameters.
- Total: 40 + 36 + 10 = 86 parameters. (Note: n_hidden1=8, n_hidden2=4, n_output=2 → recalculate.)
- Weight matrix shape for layer \(\ell\):
(n_out, n_in)— rows are output neurons, columns are input neurons.
Common pitfalls
- Confusing weight matrix orientation: \(W \in \mathbb{R}^{n_{out} \times n_{in}}\) so the operation is \(Wx\) with \(x \in \mathbb{R}^{n_{in}}\). If you use \(W \in \mathbb{R}^{n_{in} \times n_{out}}\) you need to transpose: \(W^T x\). Pick one convention and stick to it.
- Forgetting bias parameters: Each layer has both a weight matrix AND a bias vector. Don’t forget to count the biases.
- Using the wrong shape for np.random.randn:
np.random.randn(n_out, n_in)creates a matrix,np.random.randn(n_out)creates a vector. Use the correct form for each.
Worked solution
| |
Extra practice
- Warm-up: A network solving XOR. Use these pre-given weights to do a forward pass on all 4 XOR inputs and verify the correct outputs.
- Coding: Calculate the parameter count for the DQN network used in Atari: input=84×84×4 pixels (=28224 features), hidden1=512, hidden2=256, output=18 actions. How many parameters does this have?
- Challenge: The universal approximation theorem states that a single-hidden-layer MLP can approximate any continuous function to arbitrary precision. But in practice, deeper networks are used instead. Why? What are the practical advantages of depth over width?
- Variant: A 5-layer MLP with layers [2, 4, 8, 4, 2, 1]. Count the total parameters. Initialize all weights as NumPy arrays and print shapes.
- Debug: The MLP below has a transposed weight matrix in the second layer, causing a shape mismatch. Find and fix the bug.
- Conceptual: Why do we need non-linear activations between layers? Show algebraically that two consecutive linear layers without activation collapse to a single linear layer. Let \(h = W_2(W_1 x + b_1) + b_2\) and simplify.
- Recall: Name the three types of layers in an MLP and what computation each performs. State the weight matrix shape convention (rows = ?, columns = ?).