Learning objectives
- Derive the cross-entropy loss for binary classification and explain why it is preferred over MSE for classifiers.
- Compute the gradient of cross-entropy with respect to \(w\) in matrix form.
- Implement logistic regression training from scratch in NumPy and observe the loss decreasing over iterations.
Concept and real-world motivation
Logistic regression combines three components you already know: (1) the linear model \(z = Xw + b\), (2) the sigmoid \(p = \sigma(z)\), and (3) a new loss function designed for probabilities. Using MSE for classification would make the loss surface very flat near 0 and 1, making gradients vanishingly small. The right loss for probabilities is cross-entropy:
\[L = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i) \right]\]
This loss is large when the model is confident and wrong, and near zero when the model is correct. The elegant fact about logistic regression is that its gradient has a beautifully simple form:
\[\nabla_w L = \frac{1}{n} X^T (\hat{p} - y)\]
This is the same structure as linear regression’s gradient — the only difference is the “residual” is \(\hat{p} - y\) instead of \(\hat{y} - y\).
RL connection: The softmax policy in RL is logistic regression generalised to multiple actions. The policy network computes \(z = W^T s\) (one value per action), passes it through softmax, and outputs action probabilities \(\pi(a \mid s)\). The REINFORCE policy gradient objective is the expected log probability of chosen actions — cross-entropy in disguise. Mastering logistic regression here means policy gradient is just the same math with a different dataset.
Illustration: The cross-entropy loss drops as logistic regression training progresses.
Before implementing training, verify the loss formula by hand:
Exercise: Implement full logistic regression training on a toy dataset. Forward pass, cross-entropy loss, gradient computation, and weight update — all in NumPy.
Professor’s hints
- Forward:
z = X @ w + b(shape(n,)), thenp = sigmoid(z)(shape(n,)). - Cross-entropy:
loss = -np.mean(y * np.log(p) + (1-y) * np.log(1-p))— add a small epsilon (e.g.1e-9) insidelogif you hit numerical warnings. - Gradient w.r.t. w:
(1/n) * X.T @ (p - y)— shape(d,). - Gradient w.r.t. b:
(1/n) * np.sum(p - y)— scalar. - Update:
w = w - lr * grad_w. Note: subtract (we minimise loss), not add.
Common pitfalls
- Adding the gradient instead of subtracting:
w = w + lr * grad_wis gradient ascent — it maximises the loss, which is the wrong direction for a classifier. - Taking log of zero: When the model is very confident and wrong (\(\hat{p} \approx 0\) but \(y=1\)), \(\log(0) = -\infty\). Add a small epsilon:
np.log(p + 1e-9). - Forgetting to include the \((1-y)\log(1-\hat{p})\) term: The cross-entropy formula has two terms — one for positive examples (\(y=1\)) and one for negative (\(y=0\)). Missing either term silently breaks the loss.
Worked solution
Complete logistic regression from scratch:
| |
Expected output: loss decreases from ~0.6931 at step 0 to ~0.20 by step 80.
Extra practice
- Warm-up: For labels
y=[1, 0]and predicted probabilitiesp=[0.9, 0.1], compute the cross-entropy by hand. Then tryp=[0.5, 0.5]. Which has higher loss and why? - Coding: Use
sklearn.linear_model.LogisticRegressionon the same toy dataset. Comparemodel.coef_andmodel.intercept_to the weights your NumPy implementation converges to after 500 steps. - Challenge: Extend the implementation to multi-class classification using softmax instead of sigmoid. For 3 classes, replace \(w \in \mathbb{R}^d\) with \(W \in \mathbb{R}^{d \times 3}\) and compute \(\text{softmax}(Xw)\). Use cross-entropy for 3 classes.
- Variant: Try learning rates
lr = 0.01, 0.1, 1.0. Plot the loss curve for each. What happens withlr=1.0— does it converge, oscillate, or diverge? - Debug: The gradient update below adds the gradient instead of subtracting it, causing the loss to increase instead of decrease. Fix it.
- Conceptual: Logistic regression is a linear model — the decision boundary is always a straight line (or hyperplane). When would this be a problem? Name a dataset where logistic regression would fail and a model that could succeed.
- Recall: Write the cross-entropy loss formula for binary classification from memory. Then write the gradient \(\nabla_w L\) in matrix form.