Loss Functions: Measuring How Wrong the Network Is

Learning objectives

Implement mean squared error (MSE) and binary cross-entropy loss in NumPy.
Explain when to use each loss function and connect them to their RL equivalents.
Identify the numerical stability issue in cross-entropy and fix it with an epsilon clamp.

Concept and real-world motivation

A neural network learns by minimizing a loss function — a scalar that measures how wrong the current predictions are. The loss function is the signal that backpropagation differentiates to compute gradients. Choose the wrong loss for your task, and the network will optimize for the wrong thing.

Mean Squared Error (MSE) is the standard loss for regression tasks — when the output is a continuous number: \[L_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

Cross-Entropy (CE) is the standard loss for classification tasks — when the output is a probability distribution: \[L_{\text{CE}} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c} y_{ic} \log(\hat{p}_{ic})\]

Binary Cross-Entropy (BCE) is a special case for binary (two-class) problems: \[L_{\text{BCE}} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]\]

In RL:

DQN uses MSE between the predicted Q-value and the Bellman target: \(L = (r + \gamma \max_{a’} Q(s’, a’) - Q(s, a))^2\). This is regression — predicting a scalar value.
Policy gradient methods use a variant of cross-entropy: the policy gradient loss maximizes the log probability of good actions: \(L = -\log \pi(a|s) \cdot G_t\).

The choice of loss is not arbitrary — it must match the output distribution. Q-values are unbounded real numbers → MSE. Action probabilities are distributions over discrete actions → cross-entropy.

Illustration:

Exercise: Implement MSE and binary cross-entropy in NumPy. Verify that cross-entropy is lower when predictions are more confident and correct.

Try it — edit and run (Shift+Enter)

Professor’s hints

MSE: np.mean((y_true - y_pred)**2). Simple.
Binary CE: np.mean(-(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))).
If y_pred is exactly 0 or 1, np.log(0) = -inf. Always clip: np.clip(y_pred, 1e-7, 1 - 1e-7).
Cross-entropy is always ≥ 0. MSE can theoretically be 0 if predictions are perfect.

Common pitfalls

Taking log of 0: np.log(0) = -inf and -inf * 0 = nan. Always add a small epsilon or use np.clip.
Using MSE for classification: MSE doesn’t work well for classification because the probability outputs don’t interact with the loss correctly — use cross-entropy instead.
Forgetting the minus sign in CE: Cross-entropy has a leading negative sign because \(\log p < 0\) for \(p \in (0,1)\). Without it, you’d be maximizing the loss.

Worked solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import numpy as np

true_labels = np.array([1, 0, 1, 1, 0])
predictions = np.array([0.9, 0.2, 0.7, 0.8, 0.3])

def binary_cross_entropy(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)  # numerical stability
    return np.mean(-(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)))

def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

print(f'BCE: {binary_cross_entropy(true_labels, predictions):.4f}')  # ≈ 0.1879
print(f'MSE: {mse(np.array([2.5,3.0,1.5,4.0]), np.array([2.3,3.2,1.8,3.7])):.4f}')  # ≈ 0.055

Extra practice

Warm-up: Compute MSE by hand for predictions [2.0, 4.0] vs true values [3.0, 3.0]. Then verify with NumPy.

Try it — edit and run (Shift+Enter)

Coding: Implement multi-class cross-entropy for a 3-class problem. Given y_true = [[1,0,0],[0,1,0],[0,0,1]] (one-hot) and y_pred = [[0.7,0.2,0.1],[0.1,0.8,0.1],[0.2,0.2,0.6]], compute the loss.
Challenge: The DQN loss is \((r + \gamma \max_{a’} Q(s’,a’) - Q(s,a))^2\). This looks like MSE with y_true = r + γ * max Q(s',a') and y_pred = Q(s,a). Implement this “TD error” loss for a batch of 4 transitions with random Q-values.
Variant: Huber loss combines MSE and MAE: it is MSE for small errors and MAE for large errors, making it more robust to outliers. Implement it and compare to MSE on predictions [5.0, 0.1] vs true [0.0, 0.0].

Try it — edit and run (Shift+Enter)

Debug: The cross-entropy below will produce nan because it takes the log of 0. Fix it by clipping predictions to a small epsilon.

Try it — edit and run (Shift+Enter)

Conceptual: Why does cross-entropy work better than MSE for classification? Consider what happens to the gradient of MSE when the sigmoid output is near 0 or 1 — the gradient becomes very small. Cross-entropy with sigmoid has a much cleaner gradient. Explain this in one paragraph.
Recall: Write the MSE and binary cross-entropy formulas from memory. State the task type (regression or classification) each is used for. Name the RL algorithm that uses each.