Activation Functions: Adding Non-Linearity

Learning objectives

Implement ReLU, sigmoid, tanh, and softmax in NumPy and state their formulas.
Explain why activation functions are necessary — without them, all layers collapse to one linear transformation.
Identify which activation to use for hidden layers versus output layers in RL networks.

Concept and real-world motivation

Without activation functions, a neural network is just a sequence of matrix multiplications — which collapses to a single matrix multiplication no matter how many layers you stack. Mathematically: \(W_2(W_1 x + b_1) + b_2 = (W_2 W_1) x + (W_2 b_1 + b_2)\), which is just \(Ax + c\) for some matrix \(A\) and vector \(c\). A ten-layer linear network is no more expressive than a one-layer linear network. Activation functions break this linearity, allowing networks to represent complex, non-linear functions.

The four most important activations:

ReLU \(f(z) = \max(0, z)\): the workhorse of deep learning. Fast, sparse, avoids vanishing gradients. Default for hidden layers.
Sigmoid \(\sigma(z) = \frac{1}{1+e^{-z}}\): squashes output to (0,1). Used for binary classification outputs. Prone to vanishing gradients in deep networks.
Tanh \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\): squashes to (−1, 1). Zero-centered, which is better than sigmoid for hidden layers. Still suffers from vanishing gradients for very large |z|.
Softmax \(\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\): converts a vector of scores into probabilities (sum to 1). Used for multi-class classification outputs and policy networks in RL.

In RL networks: ReLU is the standard activation for hidden layers. The output layer depends on the task: for DQN, the output is unbounded Q-values (no activation, or linear output). For policy gradient methods, the output is action probabilities — softmax over actions. The critic network outputs a scalar value estimate — also linear output.

Illustration:

Exercise: Implement all four activation functions in NumPy. Compute and print ReLU, sigmoid, and tanh on an array of values, then compute softmax on a logit vector.

Try it — edit and run (Shift+Enter)

Professor’s hints

ReLU: np.maximum(0, z) (not np.max). np.maximum is element-wise; np.max returns a single value.
Softmax: exp_z = np.exp(z); return exp_z / np.sum(exp_z).
Tanh is available as np.tanh(z) — use the manual formula to understand it, then verify with np.tanh.
The sigmoid output at z=0 is exactly 0.5. The tanh output at z=0 is exactly 0.

Common pitfalls

np.max vs np.maximum in ReLU: np.maximum(0, z) returns an array of the same shape as z. np.max(z) returns a single scalar.
Softmax numerical instability: Computing np.exp(z) for large z (e.g., z=1000) overflows to inf. The fix is to subtract the maximum first: exp_z = np.exp(z - np.max(z)).
Softmax for a scalar: Softmax is defined for vectors. Applying it to a single number always returns 1.0 — which is trivially correct but useless.

Worked solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np

z = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])

def relu(z):
    return np.maximum(0, z)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh_manual(z):
    return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))

def softmax(z):
    exp_z = np.exp(z - np.max(z))  # numerically stable
    return exp_z / np.sum(exp_z)

print('relu(z)    :', relu(z))           # [0, 0, 0, 1, 3]
print('sigmoid(z) :', sigmoid(z).round(4))  # [0.0474, 0.2689, 0.5, 0.7311, 0.9526]
print('tanh(z)    :', tanh_manual(z).round(4))  # [-0.9951, -0.7616, 0.0, 0.7616, 0.9951]

logits = np.array([1.0, 2.0, 3.0])
probs = softmax(logits)
print('softmax:', probs.round(4))   # [0.0900, 0.2447, 0.6652]
print('sum:', np.sum(probs))         # 1.0

Extra practice

Warm-up: Compute softmax by hand for [1, 2, 3]. Step 1: compute \(e^1, e^2, e^3\). Step 2: sum them. Step 3: divide each by the sum. Check with your implementation.

Try it — edit and run (Shift+Enter)

Coding: The “dead ReLU” problem: if z is always negative, ReLU always outputs 0 and the gradient is always 0 — the neuron never learns. Demonstrate this with z = np.array([-5.0, -3.0, -2.0, -1.0]). Compute ReLU and its gradient (1 where z > 0, else 0).
Challenge: Softmax is sensitive to the scale of its inputs. Compute softmax([1, 2, 3]), softmax([10, 20, 30]), and softmax([0.1, 0.2, 0.3]). Describe how scaling changes the “peakiness” of the output distribution. How does this relate to the temperature parameter in RL exploration?
Variant: Leaky ReLU: \(f(z) = z\) if \(z > 0\), else \(0.01 z\). This avoids dead neurons. Implement it and compare its output to ReLU for z = [-3, -1, 0, 1, 3].

Try it — edit and run (Shift+Enter)

Debug: The softmax below forgets to subtract the maximum before exponentiating, causing overflow for large inputs. Fix the numerical stability bug.

Try it — edit and run (Shift+Enter)

Conceptual: Why is tanh often preferred over sigmoid for hidden layers? Think about the output range (sigmoid: 0 to 1, tanh: −1 to 1) and what this means for the gradient flow during backpropagation.
Recall: Name all four activation functions covered in this page, write their formulas from memory, and state one use case for each (where in a neural network and for what task).