Use this self-check after completing ML Foundations. Pass: 9 out of 12. If you score below 9, review the topics you missed before continuing to Phase 5 (DL Foundations).
1. Predict the output
Which category does each problem belong to: supervised learning, unsupervised learning, or reinforcement learning?
- (a) Predicting house prices from square footage and location.
- (b) Grouping news articles by topic without any pre-defined categories.
- (c) Teaching a robot to walk by giving it +1 for each step it takes without falling.
Answer
(a) Supervised — you have labeled examples (house price = label, features = input).
(b) Unsupervised — no labels; the algorithm discovers clusters.
(c) Reinforcement learning — the agent receives a reward signal and must learn from interaction.
2. Write a function
Implement MSE (mean squared error) for two arrays.
Answer
| |
MSE = average of squared differences. It penalizes large errors more than small ones (squared). Used in linear regression and DQN’s TD loss.
3. Find the bug
This gradient descent loop is supposed to minimize w toward the minimum of f(w) = (w-5)^2, but w diverges instead of converging.
| |
Answer
Bug: w = w + lr * gradient adds the gradient instead of subtracting it. Gradient descent moves opposite to the gradient to minimize the loss.
Fix: w = w - lr * gradient
With the fix, \(w\) converges to 5 (the minimum of \((w-5)^2\)). With the bug, \(w\) moves away from 5 and diverges.
4. Predict the output
For a logistic regression model, what are sigmoid(0), sigmoid(1), and sigmoid(-1)?
Answer
sigmoid(0) = 0.5(the decision boundary — equal probability for both classes)sigmoid(1) ≈ 0.731(more likely class 1)sigmoid(-1) ≈ 0.269(more likely class 0)
These outputs are interpreted as probabilities. If sigmoid(z) > 0.5, predict class 1; otherwise predict class 0.
5. Write a function
Implement accuracy: fraction of predictions matching true labels.
Answer
| |
== returns a boolean array (True/False = 1/0), and np.mean computes the fraction of True values.
6. Find the bug
This preprocessing pipeline has a data leakage bug. Find and fix it.
| |
Answer
Bug: scaler.fit(X) fits on the entire dataset including test samples. This leaks test information (mean and std of test data) into the preprocessing, giving optimistic results that won’t generalize to truly unseen data.
Fix:
| |
7. Conceptual
What is overfitting? How does cross-validation help detect and prevent it?
Answer
Overfitting occurs when a model memorizes the training data — it performs well on training examples but poorly on new, unseen data. Signs: training accuracy » validation accuracy; training loss « validation loss.
Cross-validation (e.g. k-fold) helps by: (1) detecting overfitting — if CV score « training score, the model is overfitting; (2) providing a more reliable estimate of generalization performance by testing on multiple non-overlapping validation sets; (3) guiding hyperparameter selection without touching the held-out test set.
8. Predict the output
K-nearest neighbors (K=1) with training points: A=(1,1,class 0), B=(3,3,class 1), C=(2,2,class 0). For new point P=(2.5, 2.5), what does K=1 KNN predict?
Answer
Compute distances from P=(2.5,2.5) to each training point:
- d(P,A) = sqrt((2.5-1)²+(2.5-1)²) = sqrt(4.5) ≈ 2.12
- d(P,B) = sqrt((2.5-3)²+(2.5-3)²) = sqrt(0.5) ≈ 0.71
- d(P,C) = sqrt((2.5-2)²+(2.5-2)²) = sqrt(0.5) ≈ 0.71
B and C are tied. In case of ties, scikit-learn picks the first in training order. Here the closest point is B (class 1) by a strict 0-distance tie-break or by convention. Prediction: class 1 (if B wins) or class 0 (if C wins).
In practice, use K>1 to avoid sensitivity to single-point ties.
9. Write a function
Compute precision given TP, FP, FN counts.
Answer
| |
Precision = TP / (TP + FP) = “of all predicted positives, what fraction are actually positive?” FN is not used for precision (it’s used for recall = TP / (TP + FN)).
10. Find the bug
This entropy formula is used in a decision tree. Find the bug.
| |
Answer
Bug: np.log computes natural log (base \(e\)). Decision tree information gain uses log base 2 so entropy is measured in bits.
Fix: replace np.log with np.log2:
| |
With log2, entropy(0.5) = 1.0 bit (maximum uncertainty for binary). With log, entropy(0.5) ≈ 0.693 (nats, not bits).
11. Predict the output
After training a logistic regression classifier, what does model.predict_proba(X) return? What is the shape of the output for 5 samples and 3 classes?
Answer
predict_proba returns a 2D array of shape (n_samples, n_classes) — for 5 samples and 3 classes, shape is (5, 3).
Each row contains the predicted probability for each class: values are in [0,1] and each row sums to 1. Row \(i\) gives \([P(\text{class 0}|x_i), P(\text{class 1}|x_i), P(\text{class 2}|x_i)]\).
To get hard predictions: np.argmax(predict_proba(X), axis=1) (same as predict(X)).
12. Conceptual
Why does linear regression fail on XOR? What do we need instead?
Answer
XOR is not linearly separable: no single straight line can separate the two classes (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. Linear regression / logistic regression can only learn linear decision boundaries.
To solve XOR, we need non-linear models: a neural network with at least one hidden layer (even 2 hidden neurons suffice), or kernelized SVMs, or decision trees. The hidden layer learns non-linear feature combinations that make the problem linearly separable in the transformed space.
In RL context: Q-tables are linear in state; for complex state spaces (e.g. continuous or high-dimensional), we need non-linear function approximators (neural networks). XOR is the classic demonstration of why.
Score: 9–12: Ready for Phase 5 (DL Foundations). 7–8: Review the specific topics you missed before continuing. Below 7: Complete ML Foundations and return to this assessment.