Learning objectives
- Explain why a single train/test split gives a noisy estimate of model quality.
- Implement K-fold cross-validation from scratch using array slicing.
- Describe the bias-variance tradeoff and identify overfitting and underfitting from error curves.
Concept and real-world motivation
A single train/test split is like grading a student on one exam. If that exam happens to cover topics they know well (or badly), the grade is misleading. K-fold cross-validation fixes this: divide data into K equal parts (“folds”), then train on K-1 folds and test on the remaining fold, rotating K times. Average the K test scores for a much more stable estimate.
The deeper issue K-fold guards against is overfitting: a model that memorizes training data but fails on new data. The opposite — a model too simple to capture the pattern — is underfitting. The bias-variance tradeoff names this tension: simple models have high bias (systematic error), complex models have high variance (sensitivity to training data). The sweet spot minimizes total error on held-out data. In RL, we evaluate agents across multiple random seeds — exactly the K-fold idea applied to policy evaluation.
Illustration: Test error shows a U-shape as model complexity grows — low at the sweet spot.
Exercise: Implement 5-fold cross-validation manually on a synthetic dataset, without using sklearn’s cross_val_score.
Professor’s hints
train_idx= all indices inrange(20)that are NOT intest_idx. Use a list comprehension:[i for i in range(len(X)) if i not in test_idx].- You can use any classifier here. A majority-vote baseline is fine for the exercise structure. Try replacing it with a real model from sklearn if you have it available.
- The variance of CV scores tells you how stable the estimate is. High variance across folds means you need more data or a simpler model.
Common pitfalls
- Overlapping folds: Ensure test indices are non-overlapping across iterations. Off-by-one errors (
range(fold * fold_size, (fold+1) * fold_size)vs.+1) cause overlap and data leakage. - Leaking test data into preprocessing: If you normalize with statistics computed on all data before splitting, information from the test fold contaminates training. Always fit scalers only on the training fold.
- Using K=len(data) (leave-one-out) blindly: LOOCV is unbiased but has high variance and is slow on large datasets. K=5 or K=10 is usually the right default.
Worked solution
| |
Extra practice
- Warm-up: For a dataset of 10 samples, list the exact train and test indices for each fold when K=5 (fold_size=2). Write them out as lists.
- Coding: Compare 5-fold CV scores for a degree-1 polynomial (underfitting) and a degree-5 polynomial (overfitting) on the same 20-sample dataset. Which has lower test error?
- Challenge: Implement stratified K-fold from scratch: ensure each fold has approximately the same class balance as the full dataset. Compare CV scores with and without stratification on a heavily imbalanced dataset (90% class 0, 10% class 1).
- Variant: Re-run the main exercise with K=3 and K=10. How does mean CV score and its standard deviation change? Why is K=10 called “leave-more-out” when fold sizes are larger?
- Debug: The code below has a bug — fold indices overlap because the range end is wrong. Find and fix it.
- Conceptual: Why is the harmonic mean (used in F1) more appropriate than the arithmetic mean when averaging precision and recall? Does the same logic apply to averaging CV fold scores?
- Recall: From memory, describe the K-fold algorithm in 4 steps. Then write the formula for the CV score as the average of fold test scores.