Quick-fire practice for DL Foundations. Work through these after completing the main pages. Answers in the collapses; pyrepl blocks for coding problems.
Recall (R)
R1. What is the vanishing gradient problem? When does it occur?
Answer
The vanishing gradient problem occurs when gradients become extremely small as they propagate backward through many layers. Each layer multiplies the gradient by the derivative of its activation function — for sigmoid/tanh, derivatives are at most 0.25/1, so stacking many layers shrinks the gradient toward zero. Earlier layers receive nearly no learning signal.
It occurs most severely with sigmoid or tanh activations in deep networks (many layers). ReLU largely solves it because ReLU’s derivative is 1 for positive inputs (no squashing).
R2. Write the ReLU function and its derivative.
Answer
\(\text{ReLU}(z) = \max(0, z)\)
Derivative: \(\frac{d}{dz}\text{ReLU}(z) = \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z \leq 0 \end{cases}\)
In NumPy: relu = lambda z: np.maximum(0, z) and relu_backward = lambda dz, z: dz * (z > 0).
R3. What is the purpose of bias terms in a neural network?
Answer
R4. What is the difference between a batch and an epoch?
Answer
- Batch (mini-batch): A subset of the training data processed in one forward-backward pass. Typical sizes: 32, 64, 128 samples.
- Epoch: One complete pass through the entire training dataset. Multiple epochs = the data is seen many times.
One epoch = (dataset size / batch size) gradient update steps. If dataset has 1000 samples and batch size is 100, one epoch = 10 gradient steps.
R5. Why do we need activation functions? What happens without them?
Answer
Activation functions introduce non-linearity. Without them, a stack of linear layers collapses to a single linear transformation: \(W_2(W_1 x + b_1) + b_2 = W’ x + b’\). No matter how many layers you add, you can only represent linear functions — XOR and most real-world problems are non-linear.
Activation functions (ReLU, sigmoid, tanh) allow the network to learn non-linear decision boundaries.
Compute (C)
C1. Forward pass: \(x=[1,0]\), \(W=\begin{bmatrix}1&2\3&4\end{bmatrix}\), \(b=[0,0]\), activation = ReLU. Compute \(h = \text{ReLU}(Wx+b)\).
Answer
\(Wx + b = \begin{bmatrix}1&2\3&4\end{bmatrix}\begin{bmatrix}1\0\end{bmatrix} + \begin{bmatrix}0\0\end{bmatrix} = \begin{bmatrix}1\3\end{bmatrix}\)
\(h = \text{ReLU}\begin{bmatrix}1\3\end{bmatrix} = \begin{bmatrix}1\3\end{bmatrix}\) (both positive, unchanged)
C2. MSE loss for predictions \([0.8, 0.2]\) and true values \([1, 0]\).
Answer
C3. Softmax for logits \([2, 1, 0.5]\). Show each step.
Answer
Step 1 — exponentiate: \(e^2 \approx 7.389\), \(e^1 \approx 2.718\), \(e^{0.5} \approx 1.649\).
Step 2 — sum: \(7.389 + 2.718 + 1.649 \approx 11.756\).
Step 3 — divide: \([7.389/11.756, 2.718/11.756, 1.649/11.756] \approx [0.629, 0.231, 0.140]\).
Check: \(0.629 + 0.231 + 0.140 = 1.000\) ✓
C4. SGD update: \(w=1.5\), gradient=\(0.3\), \(lr=0.1\). What is \(w_{new}\)?
Answer
C5. L2 penalty for \(w=[2, -1, 0.5]\) with \(\lambda=0.01\).
Answer
Code (K)
K1. Implement relu_backward(dz, z) — gradient of ReLU.
Answer
| |
The gradient of ReLU is 1 where the input was positive, 0 where it was negative (or zero). Multiply element-wise by the incoming gradient dz (chain rule).
K2. Implement one_hot(y, n_classes) that converts integer labels to one-hot vectors.
Answer
| |
np.arange(len(y)) selects each row, y indexes the column — this places a 1 in the correct position for each sample simultaneously.
Debug (D)
D1. Backprop bug: gradients for W2 computed before gradients for the output layer.
Answer
In backprop, you must process layers in reverse order: output layer first, then hidden layers. Computing dW2 requires delta2 (the gradient at the output), and computing dW1 requires delta1 which depends on delta2 and W2. If you compute dW2 = delta1.T @ a1 before computing delta2, delta1 doesn’t exist yet.
Fix: always follow the chain rule order — compute output layer gradients first, then propagate backward:
| |
D2. Adam bug: bias correction terms missing (\(\hat{m}\) and \(\hat{v}\) not computed).
Answer
Without bias correction, in the first few steps Adam’s update is far too small. At step \(t=1\) with \(\beta_1=0.9\): \(m = 0.1 \cdot g\). The update uses \(m\) directly instead of \(\hat{m} = m / (1-0.9) = m / 0.1 = g\). So the first step is 10× smaller than intended.
Fix:
| |
Challenge (X)
X1. Implement a 3-layer MLP with Adam optimizer. Train on XOR data (4 samples) for 1000 epochs. XOR: inputs [(0,0),(0,1),(1,0),(1,1)], outputs [0,1,1,0].
Answer
XOR requires at least one hidden layer — it’s not linearly separable. After ~1000 epochs with Adam and 3 layers, the network converges: outputs ≈ [0, 1, 1, 0].
Key insight: if loss doesn’t decrease, try larger initial weights (* 1.0 instead of * 0.5) or a higher learning rate.