Learning objectives
- Write the matrix form of linear regression \(\hat{y} = Xw + b\) and identify the shape of each term.
- Compute predictions, MSE, and the gradient \(\nabla_w MSE\) using NumPy matrix operations.
- Perform one vectorized gradient descent step on a multi-feature dataset.
Concept and real-world motivation
Real datasets have many input features, not just one. A house’s price depends on size, number of rooms, age, location score, and more. When we have \(d\) features, the linear model becomes:
\[\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b = w^T x + b\]
For \(n\) samples at once, we pack all inputs into a matrix \(X \in \mathbb{R}^{n \times d}\) and all weights into a vector \(w \in \mathbb{R}^d\):
\[\hat{y} = Xw + b\]
The gradient of MSE with respect to \(w\) in matrix form is:
\[\nabla_w MSE = \frac{-2}{n} X^T (y - \hat{y})\]
This single formula replaces the per-feature loop — one matrix multiply does the work for all \(d\) features simultaneously. This is vectorization, and it is how all modern ML code achieves speed.
RL connection: RL agents often operate in continuous state spaces where the state \(s\) is a vector of measurements: position \((x, y)\), velocity \((\dot{x}, \dot{y})\), angle \(\theta\), angular velocity \(\dot\theta\), and more. Linear value function approximation computes \(V(s) = w^T s\) — which is exactly \(\hat{y} = Xw\) (one sample at a time). Even when we move to neural networks, the first and last layers are matrix multiplications of this exact form.
Illustration: Multiple features all contribute to a single prediction through a weighted sum.
Let us first verify matrix multiply by hand for a tiny 2×2 case:
Exercise: Given a 5-sample, 3-feature dataset, compute predictions, MSE, the gradient \(\nabla_w MSE\), and perform one gradient step — all using NumPy matrix operations.
Professor’s hints
X @ wperforms the matrix-vector multiply — NumPy handles all \(n\) samples at once.X.Ttransposes X — shape goes from(n, d)to(d, n).X.T @ (y - y_hat)is a dot product of each feature column with the residuals — shape(d,), one gradient per weight.- Check shapes at every step:
y_hat.shapeshould be(5,),grad_w.shapeshould be(3,). - The learning rate
1e-7is small because the features (sizes in sq metres, prices in dollars) have very different scales.
Common pitfalls
- Shape mismatch: If
whas shape(3, 1)instead of(3,),X @ wreturns shape(5, 1)instead of(5,), and subsequent operations may behave unexpectedly. Always use 1-D weight vectors:w = np.array([...])notw = np.array([[...]]). - Forgetting to transpose:
X.T @ residuals(shape(d,)) is correct.X @ residuals(shape(n,)) is wrong — it computes a residual-weighted combination of feature vectors, not the gradient. - Using a loop instead of matrix ops: Writing
for j in range(d): grad_w[j] = ...works but is 10–100× slower thanX.T @ residuals. Always prefer matrix operations.
Worked solution
Fully vectorized multiple regression step:
| |
Extra practice
- Warm-up: Manually compute \(\hat{y}\) for sample 1 (size=80, rooms=3, age=10) using weights \(w=[2000, 15000, -500], b=50000\). Show each multiplication. Verify with NumPy.
- Coding: Use
sklearn.linear_model.LinearRegressionto fit the same X, y data. Comparemodel.coef_to the weights your gradient descent converges to after 10,000 steps. - Challenge: Add a fourth feature to X: distance from city centre (in km) = [5, 2, 8, 1, 4]. Re-run gradient descent. Does adding this feature lower the final MSE?
- Variant: Normalise each feature column in X to have mean 0 and std 1 before training. How does this affect the learning rate you need and the convergence speed?
- Debug: The code below has a shape mismatch bug —
wis defined as a column vector, causingX @ wto have shape(5, 1)instead of(5,). Fix it.
- Conceptual: Why is the matrix form \(\hat{y} = Xw + b\) more useful than writing a separate equation for each sample? What would happen if \(n = 1{,}000{,}000\)?
- Recall: Write the matrix gradient formula \(\nabla_w MSE\) from memory. State the shape of \(X\), \(w\), and the gradient.