Short drills for the full ML Foundations section. Work through these after completing pages 1–13 to consolidate your understanding before the review.

Recall (R) — State definitions and rules

R1. What is the difference between supervised and unsupervised learning? Give one example of each.

R1 answer

Supervised learning: Each training example has a label \(y\). The model learns a mapping \(f: X \to y\). Example: predicting house price from features (regression) or classifying emails as spam/not-spam (classification).

Unsupervised learning: No labels. The algorithm finds structure in the data itself. Example: K-Means clustering of customer purchase history to identify segments.

R2. What does MSE stand for? Write the formula.

R2 answer
Mean Squared Error. For N predictions: \[\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\] It penalizes large errors more than small ones (due to squaring) and is always non-negative.

R3. Why do we use train/test split instead of evaluating on training data?

R3 answer
Evaluating on training data measures memorization, not generalization. A model that overfits achieves near-zero training error but fails on new data. A held-out test set provides an unbiased estimate of how the model performs on unseen examples — the quantity we actually care about in practice.

R4. What is the connection between the Markov decision process and supervised learning?

R4 answer
In RL, the value function \(V(s)\) and action-value function \(Q(s,a)\) are learned from experience — effectively a supervised regression problem where targets are bootstrapped returns. The policy \(\pi(a|s)\) can be viewed as a classifier that maps states to action distributions. Both use gradient descent to minimize a loss, the same optimization algorithm that drives supervised learning.

R5. State the update rule for gradient descent.

R5 answer
\[w \leftarrow w - \alpha \frac{\partial \mathcal{L}}{\partial w}\] where \(\alpha\) is the learning rate and \(\frac{\partial \mathcal{L}}{\partial w}\) is the gradient of the loss with respect to the parameter \(w\). Move in the opposite direction of the gradient to reduce loss.

Compute (C) — Numerical exercises

C1. Compute MSE for predictions \([2, 4, 6]\) and true values \([2.5, 3.5, 6.5]\).

Try it — edit and run (Shift+Enter)
C1 answer

Errors: \(-0.5, +0.5, -0.5\). Squared: \(0.25, 0.25, 0.25\). Mean = \(0.25\).

1
mse = np.mean((preds - true)**2)  # 0.25

C2. For logistic regression with \(z = 1.5\), compute \(\sigma(z) = \frac{1}{1 + e^{-z}}\).

Try it — edit and run (Shift+Enter)
C2 answer

\(\sigma(1.5) = \frac{1}{1 + e^{-1.5}} = \frac{1}{1 + 0.2231} \approx 0.8176\).

1
sigma = 1 / (1 + np.exp(-1.5))  # 0.8176

C3. For TP=4, FP=1, FN=2: compute precision and recall.

Try it — edit and run (Shift+Enter)
C3 answer

Precision = \(\frac{4}{4+1} = 0.80\). Recall = \(\frac{4}{4+2} \approx 0.6667\).

1
2
precision = TP / (TP + FP)  # 0.80
recall    = TP / (TP + FN)  # 0.6667

C4. Gradient step: \(w = 3\), gradient \(= 0.5\), learning rate \(\alpha = 0.1\). Compute \(w_{\text{new}}\).

C4 answer
\(w_{\text{new}} = w - \alpha \cdot \nabla = 3 - 0.1 \times 0.5 = 3 - 0.05 = 2.95\).

C5. Compute entropy in bits for a dataset with 3 positives and 3 negatives.

Try it — edit and run (Shift+Enter)
C5 answer

\(H = -0.5 \log_2 0.5 - 0.5 \log_2 0.5 = -0.5 \times (-1) - 0.5 \times (-1) = 1.0\) bit. Maximum entropy for a binary distribution.

1
H = -p * np.log2(p) - p * np.log2(p)  # 1.0

Code (K) — Implementation

K1. Write linear_regression_predict(X, w, b) that returns \(Xw + b\).

Try it — edit and run (Shift+Enter)
K1 answer
1
2
3
4
5
def linear_regression_predict(X, w, b):
    return X @ w + b

# X @ w = [1*0.5 + 2*0.5, 3*0.5 + 4*0.5, 5*0.5 + 6*0.5] = [1.5, 3.5, 5.5]
# + b=1 → [2.5, 4.5, 6.5]  ← check with your values

K2. Write accuracy(y_true, y_pred) that returns the fraction of correct predictions.

Try it — edit and run (Shift+Enter)
K2 answer
1
2
3
def accuracy(y_true, y_pred):
    return np.mean(y_true == y_pred)
# np.mean of boolean array = fraction of True values

Debug (D) — Find and fix the bug

D1. The cross-entropy loss uses log(1 - p) for positive examples. Fix it.

Try it — edit and run (Shift+Enter)
D1 answer

The terms log(1 - p) and log(p) are swapped. The correct cross-entropy is:

\[\mathcal{L} = -\frac{1}{N} \sum_i [y_i \log p_i + (1-y_i) \log(1-p_i)]\]

1
2
3
4
5
6
def fixed_cross_entropy(y_true, y_pred_prob):
    loss = -np.mean(
        y_true * np.log(y_pred_prob) +
        (1 - y_true) * np.log(1 - y_pred_prob)
    )
    return loss

D2. The gradient descent adds the gradient instead of subtracting it. Fix it.

Try it — edit and run (Shift+Enter)
D2 answer

Gradient descent moves opposite to the gradient: \(w \leftarrow w - \alpha \nabla\). Adding the gradient would climb the loss surface (gradient ascent), not descend it.

1
2
def fixed_gradient_step(w, gradient, lr=0.1):
    return w - lr * gradient  # 3.0 - 0.1*0.5 = 2.95

Challenge (X)

X1. Implement K-Means from scratch on a dataset of 20 2D points. Run for 10 iterations and plot the cluster assignments after each step. Use K=3 and random seed 42.

Try it — edit and run (Shift+Enter)
X1 solution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import numpy as np
np.random.seed(42)
X = np.vstack([
    np.random.randn(7, 2) + [0, 0],
    np.random.randn(7, 2) + [5, 5],
    np.random.randn(6, 2) + [0, 5],
])
K = 3
centroids = X[np.random.choice(len(X), K, replace=False)]
for i in range(10):
    dists = np.linalg.norm(X[:, None, :] - centroids[None, :, :], axis=2)
    assignments = np.argmin(dists, axis=1)
    for k in range(K):
        if (assignments == k).sum() > 0:
            centroids[k] = X[assignments == k].mean(axis=0)
for k in range(K):
    print(f'Cluster {k}: {(assignments == k).sum()} points')