R1. What is the difference between supervised and unsupervised learning? Give one example of each.
R1 answer
Supervised learning: Each training example has a label \(y\). The model learns a mapping \(f: X \to y\). Example: predicting house price from features (regression) or classifying emails as spam/not-spam (classification).
Unsupervised learning: No labels. The algorithm finds structure in the data itself. Example: K-Means clustering of customer purchase history to identify segments.
R2. What does MSE stand for? Write the formula.
R2 answer
Mean Squared Error. For N predictions: \[\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\] It penalizes large errors more than small ones (due to squaring) and is always non-negative.
R3. Why do we use train/test split instead of evaluating on training data?
R3 answer
Evaluating on training data measures memorization, not generalization. A model that overfits achieves near-zero training error but fails on new data. A held-out test set provides an unbiased estimate of how the model performs on unseen examples — the quantity we actually care about in practice.
R4. What is the connection between the Markov decision process and supervised learning?
R4 answer
In RL, the value function \(V(s)\) and action-value function \(Q(s,a)\) are learned from experience — effectively a supervised regression problem where targets are bootstrapped returns. The policy \(\pi(a|s)\) can be viewed as a classifier that maps states to action distributions. Both use gradient descent to minimize a loss, the same optimization algorithm that drives supervised learning.
R5. State the update rule for gradient descent.
R5 answer
\[w \leftarrow w - \alpha \frac{\partial \mathcal{L}}{\partial w}\] where \(\alpha\) is the learning rate and \(\frac{\partial \mathcal{L}}{\partial w}\) is the gradient of the loss with respect to the parameter \(w\). Move in the opposite direction of the gradient to reduce loss.
D2. The gradient descent adds the gradient instead of subtracting it. Fix it.
Try it — edit and run (Shift+Enter)
Runs in your browser — no install needed
D2 answer
Gradient descent moves opposite to the gradient: \(w \leftarrow w - \alpha \nabla\). Adding the gradient would climb the loss surface (gradient ascent), not descend it.
X1. Implement K-Means from scratch on a dataset of 20 2D points. Run for 10 iterations and plot the cluster assignments after each step. Use K=3 and random seed 42.