Learning objectives

  • Derive the cross-entropy loss for binary classification and explain why it is preferred over MSE for classifiers.
  • Compute the gradient of cross-entropy with respect to \(w\) in matrix form.
  • Implement logistic regression training from scratch in NumPy and observe the loss decreasing over iterations.

Concept and real-world motivation

Logistic regression combines three components you already know: (1) the linear model \(z = Xw + b\), (2) the sigmoid \(p = \sigma(z)\), and (3) a new loss function designed for probabilities. Using MSE for classification would make the loss surface very flat near 0 and 1, making gradients vanishingly small. The right loss for probabilities is cross-entropy:

\[L = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i) \right]\]

This loss is large when the model is confident and wrong, and near zero when the model is correct. The elegant fact about logistic regression is that its gradient has a beautifully simple form:

\[\nabla_w L = \frac{1}{n} X^T (\hat{p} - y)\]

This is the same structure as linear regression’s gradient — the only difference is the “residual” is \(\hat{p} - y\) instead of \(\hat{y} - y\).

RL connection: The softmax policy in RL is logistic regression generalised to multiple actions. The policy network computes \(z = W^T s\) (one value per action), passes it through softmax, and outputs action probabilities \(\pi(a \mid s)\). The REINFORCE policy gradient objective is the expected log probability of chosen actions — cross-entropy in disguise. Mastering logistic regression here means policy gradient is just the same math with a different dataset.

Illustration: The cross-entropy loss drops as logistic regression training progresses.

Before implementing training, verify the loss formula by hand:

Try it — edit and run (Shift+Enter)

Exercise: Implement full logistic regression training on a toy dataset. Forward pass, cross-entropy loss, gradient computation, and weight update — all in NumPy.

Try it — edit and run (Shift+Enter)

Professor’s hints

  • Forward: z = X @ w + b (shape (n,)), then p = sigmoid(z) (shape (n,)).
  • Cross-entropy: loss = -np.mean(y * np.log(p) + (1-y) * np.log(1-p)) — add a small epsilon (e.g. 1e-9) inside log if you hit numerical warnings.
  • Gradient w.r.t. w: (1/n) * X.T @ (p - y) — shape (d,).
  • Gradient w.r.t. b: (1/n) * np.sum(p - y) — scalar.
  • Update: w = w - lr * grad_w. Note: subtract (we minimise loss), not add.

Common pitfalls

  • Adding the gradient instead of subtracting: w = w + lr * grad_w is gradient ascent — it maximises the loss, which is the wrong direction for a classifier.
  • Taking log of zero: When the model is very confident and wrong (\(\hat{p} \approx 0\) but \(y=1\)), \(\log(0) = -\infty\). Add a small epsilon: np.log(p + 1e-9).
  • Forgetting to include the \((1-y)\log(1-\hat{p})\) term: The cross-entropy formula has two terms — one for positive examples (\(y=1\)) and one for negative (\(y=0\)). Missing either term silently breaks the loss.
Worked solution

Complete logistic regression from scratch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np

np.random.seed(42)
X = np.array([[1.,2.,0.5],[2.,1.,1.5],[0.5,3.,0.2],[3.,0.5,2.],[1.5,1.5,1.]])
y = np.array([1,1,0,1,0], dtype=float)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

w  = np.zeros(3)
b  = 0.0
lr = 0.1
n  = len(y)

for step in range(100):
    # Forward
    z    = X @ w + b
    p    = sigmoid(z)
    # Loss
    loss = -np.mean(y * np.log(p + 1e-9) + (1 - y) * np.log(1 - p + 1e-9))
    # Gradients
    grad_w = (1/n) * X.T @ (p - y)
    grad_b = (1/n) * np.sum(p - y)
    # Update
    w = w - lr * grad_w
    b = b - lr * grad_b
    if step % 20 == 0:
        print(f'step {step:3d}: loss={loss:.4f}')

Expected output: loss decreases from ~0.6931 at step 0 to ~0.20 by step 80.

Extra practice

  1. Warm-up: For labels y=[1, 0] and predicted probabilities p=[0.9, 0.1], compute the cross-entropy by hand. Then try p=[0.5, 0.5]. Which has higher loss and why?
  2. Coding: Use sklearn.linear_model.LogisticRegression on the same toy dataset. Compare model.coef_ and model.intercept_ to the weights your NumPy implementation converges to after 500 steps.
  3. Challenge: Extend the implementation to multi-class classification using softmax instead of sigmoid. For 3 classes, replace \(w \in \mathbb{R}^d\) with \(W \in \mathbb{R}^{d \times 3}\) and compute \(\text{softmax}(Xw)\). Use cross-entropy for 3 classes.
  4. Variant: Try learning rates lr = 0.01, 0.1, 1.0. Plot the loss curve for each. What happens with lr=1.0 — does it converge, oscillate, or diverge?
  5. Debug: The gradient update below adds the gradient instead of subtracting it, causing the loss to increase instead of decrease. Fix it.
Try it — edit and run (Shift+Enter)
  1. Conceptual: Logistic regression is a linear model — the decision boundary is always a straight line (or hyperplane). When would this be a problem? Name a dataset where logistic regression would fail and a model that could succeed.
  2. Recall: Write the cross-entropy loss formula for binary classification from memory. Then write the gradient \(\nabla_w L\) in matrix form.