Learning objectives
- Implement ReLU, sigmoid, tanh, and softmax in NumPy and state their formulas.
- Explain why activation functions are necessary — without them, all layers collapse to one linear transformation.
- Identify which activation to use for hidden layers versus output layers in RL networks.
Concept and real-world motivation
Without activation functions, a neural network is just a sequence of matrix multiplications — which collapses to a single matrix multiplication no matter how many layers you stack. Mathematically: \(W_2(W_1 x + b_1) + b_2 = (W_2 W_1) x + (W_2 b_1 + b_2)\), which is just \(Ax + c\) for some matrix \(A\) and vector \(c\). A ten-layer linear network is no more expressive than a one-layer linear network. Activation functions break this linearity, allowing networks to represent complex, non-linear functions.
The four most important activations:
- ReLU \(f(z) = \max(0, z)\): the workhorse of deep learning. Fast, sparse, avoids vanishing gradients. Default for hidden layers.
- Sigmoid \(\sigma(z) = \frac{1}{1+e^{-z}}\): squashes output to (0,1). Used for binary classification outputs. Prone to vanishing gradients in deep networks.
- Tanh \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\): squashes to (−1, 1). Zero-centered, which is better than sigmoid for hidden layers. Still suffers from vanishing gradients for very large |z|.
- Softmax \(\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\): converts a vector of scores into probabilities (sum to 1). Used for multi-class classification outputs and policy networks in RL.
In RL networks: ReLU is the standard activation for hidden layers. The output layer depends on the task: for DQN, the output is unbounded Q-values (no activation, or linear output). For policy gradient methods, the output is action probabilities — softmax over actions. The critic network outputs a scalar value estimate — also linear output.
Illustration:
Exercise: Implement all four activation functions in NumPy. Compute and print ReLU, sigmoid, and tanh on an array of values, then compute softmax on a logit vector.
Professor’s hints
- ReLU:
np.maximum(0, z)(notnp.max).np.maximumis element-wise;np.maxreturns a single value. - Softmax:
exp_z = np.exp(z); return exp_z / np.sum(exp_z). - Tanh is available as
np.tanh(z)— use the manual formula to understand it, then verify withnp.tanh. - The sigmoid output at z=0 is exactly 0.5. The tanh output at z=0 is exactly 0.
Common pitfalls
np.maxvsnp.maximumin ReLU:np.maximum(0, z)returns an array of the same shape as z.np.max(z)returns a single scalar.- Softmax numerical instability: Computing
np.exp(z)for large z (e.g., z=1000) overflows toinf. The fix is to subtract the maximum first:exp_z = np.exp(z - np.max(z)). - Softmax for a scalar: Softmax is defined for vectors. Applying it to a single number always returns 1.0 — which is trivially correct but useless.
Worked solution
| |
Extra practice
- Warm-up: Compute softmax by hand for
[1, 2, 3]. Step 1: compute \(e^1, e^2, e^3\). Step 2: sum them. Step 3: divide each by the sum. Check with your implementation.
- Coding: The “dead ReLU” problem: if z is always negative, ReLU always outputs 0 and the gradient is always 0 — the neuron never learns. Demonstrate this with
z = np.array([-5.0, -3.0, -2.0, -1.0]). Compute ReLU and its gradient (1 where z > 0, else 0). - Challenge: Softmax is sensitive to the scale of its inputs. Compute
softmax([1, 2, 3]),softmax([10, 20, 30]), andsoftmax([0.1, 0.2, 0.3]). Describe how scaling changes the “peakiness” of the output distribution. How does this relate to the temperature parameter in RL exploration? - Variant: Leaky ReLU: \(f(z) = z\) if \(z > 0\), else \(0.01 z\). This avoids dead neurons. Implement it and compare its output to ReLU for
z = [-3, -1, 0, 1, 3].
- Debug: The softmax below forgets to subtract the maximum before exponentiating, causing overflow for large inputs. Fix the numerical stability bug.
- Conceptual: Why is tanh often preferred over sigmoid for hidden layers? Think about the output range (sigmoid: 0 to 1, tanh: −1 to 1) and what this means for the gradient flow during backpropagation.
- Recall: Name all four activation functions covered in this page, write their formulas from memory, and state one use case for each (where in a neural network and for what task).