Learning objectives
- Implement mean squared error (MSE) and binary cross-entropy loss in NumPy.
- Explain when to use each loss function and connect them to their RL equivalents.
- Identify the numerical stability issue in cross-entropy and fix it with an epsilon clamp.
Concept and real-world motivation
A neural network learns by minimizing a loss function — a scalar that measures how wrong the current predictions are. The loss function is the signal that backpropagation differentiates to compute gradients. Choose the wrong loss for your task, and the network will optimize for the wrong thing.
Mean Squared Error (MSE) is the standard loss for regression tasks — when the output is a continuous number: \[L_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]
Cross-Entropy (CE) is the standard loss for classification tasks — when the output is a probability distribution: \[L_{\text{CE}} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c} y_{ic} \log(\hat{p}_{ic})\]
Binary Cross-Entropy (BCE) is a special case for binary (two-class) problems: \[L_{\text{BCE}} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]\]
In RL:
- DQN uses MSE between the predicted Q-value and the Bellman target: \(L = (r + \gamma \max_{a’} Q(s’, a’) - Q(s, a))^2\). This is regression — predicting a scalar value.
- Policy gradient methods use a variant of cross-entropy: the policy gradient loss maximizes the log probability of good actions: \(L = -\log \pi(a|s) \cdot G_t\).
The choice of loss is not arbitrary — it must match the output distribution. Q-values are unbounded real numbers → MSE. Action probabilities are distributions over discrete actions → cross-entropy.
Illustration:
Exercise: Implement MSE and binary cross-entropy in NumPy. Verify that cross-entropy is lower when predictions are more confident and correct.
Professor’s hints
- MSE:
np.mean((y_true - y_pred)**2). Simple. - Binary CE:
np.mean(-(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))). - If
y_predis exactly 0 or 1,np.log(0)= -inf. Always clip:np.clip(y_pred, 1e-7, 1 - 1e-7). - Cross-entropy is always ≥ 0. MSE can theoretically be 0 if predictions are perfect.
Common pitfalls
- Taking log of 0:
np.log(0) = -infand-inf * 0 = nan. Always add a small epsilon or usenp.clip. - Using MSE for classification: MSE doesn’t work well for classification because the probability outputs don’t interact with the loss correctly — use cross-entropy instead.
- Forgetting the minus sign in CE: Cross-entropy has a leading negative sign because \(\log p < 0\) for \(p \in (0,1)\). Without it, you’d be maximizing the loss.
Worked solution
| |
Extra practice
- Warm-up: Compute MSE by hand for predictions
[2.0, 4.0]vs true values[3.0, 3.0]. Then verify with NumPy.
- Coding: Implement multi-class cross-entropy for a 3-class problem. Given
y_true = [[1,0,0],[0,1,0],[0,0,1]](one-hot) andy_pred = [[0.7,0.2,0.1],[0.1,0.8,0.1],[0.2,0.2,0.6]], compute the loss. - Challenge: The DQN loss is \((r + \gamma \max_{a’} Q(s’,a’) - Q(s,a))^2\). This looks like MSE with
y_true = r + γ * max Q(s',a')andy_pred = Q(s,a). Implement this “TD error” loss for a batch of 4 transitions with random Q-values. - Variant: Huber loss combines MSE and MAE: it is MSE for small errors and MAE for large errors, making it more robust to outliers. Implement it and compare to MSE on predictions
[5.0, 0.1]vs true[0.0, 0.0].
- Debug: The cross-entropy below will produce
nanbecause it takes the log of 0. Fix it by clipping predictions to a small epsilon.
- Conceptual: Why does cross-entropy work better than MSE for classification? Consider what happens to the gradient of MSE when the sigmoid output is near 0 or 1 — the gradient becomes very small. Cross-entropy with sigmoid has a much cleaner gradient. Explain this in one paragraph.
- Recall: Write the MSE and binary cross-entropy formulas from memory. State the task type (regression or classification) each is used for. Name the RL algorithm that uses each.