Cross-Validation and Overfitting

Learning objectives

Explain why a single train/test split gives a noisy estimate of model quality.
Implement K-fold cross-validation from scratch using array slicing.
Describe the bias-variance tradeoff and identify overfitting and underfitting from error curves.

Concept and real-world motivation

A single train/test split is like grading a student on one exam. If that exam happens to cover topics they know well (or badly), the grade is misleading. K-fold cross-validation fixes this: divide data into K equal parts (“folds”), then train on K-1 folds and test on the remaining fold, rotating K times. Average the K test scores for a much more stable estimate.

The deeper issue K-fold guards against is overfitting: a model that memorizes training data but fails on new data. The opposite — a model too simple to capture the pattern — is underfitting. The bias-variance tradeoff names this tension: simple models have high bias (systematic error), complex models have high variance (sensitivity to training data). The sweet spot minimizes total error on held-out data. In RL, we evaluate agents across multiple random seeds — exactly the K-fold idea applied to policy evaluation.

Illustration: Test error shows a U-shape as model complexity grows — low at the sweet spot.

Exercise: Implement 5-fold cross-validation manually on a synthetic dataset, without using sklearn’s cross_val_score.

Try it — edit and run (Shift+Enter)

Professor’s hints

train_idx = all indices in range(20) that are NOT in test_idx. Use a list comprehension: [i for i in range(len(X)) if i not in test_idx].
You can use any classifier here. A majority-vote baseline is fine for the exercise structure. Try replacing it with a real model from sklearn if you have it available.
The variance of CV scores tells you how stable the estimate is. High variance across folds means you need more data or a simpler model.

Common pitfalls

Overlapping folds: Ensure test indices are non-overlapping across iterations. Off-by-one errors (range(fold * fold_size, (fold+1) * fold_size) vs. +1) cause overlap and data leakage.
Leaking test data into preprocessing: If you normalize with statistics computed on all data before splitting, information from the test fold contaminates training. Always fit scalers only on the training fold.
Using K=len(data) (leave-one-out) blindly: LOOCV is unbiased but has high variance and is slow on large datasets. K=5 or K=10 is usually the right default.

Worked solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
np.random.seed(42)
X = np.random.randn(20, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

K = 5
fold_size = len(X) // K
scores = []

for fold in range(K):
    test_idx  = list(range(fold * fold_size, (fold + 1) * fold_size))
    train_idx = [i for i in range(len(X)) if i not in test_idx]

    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    majority = int(np.mean(y_train) >= 0.5)
    preds = np.full(len(y_test), majority)
    acc = np.mean(preds == y_test)
    scores.append(acc)

print(f'Mean CV: {np.mean(scores):.3f} +/- {np.std(scores):.3f}')

Extra practice

Warm-up: For a dataset of 10 samples, list the exact train and test indices for each fold when K=5 (fold_size=2). Write them out as lists.

Try it — edit and run (Shift+Enter)

Coding: Compare 5-fold CV scores for a degree-1 polynomial (underfitting) and a degree-5 polynomial (overfitting) on the same 20-sample dataset. Which has lower test error?

Try it — edit and run (Shift+Enter)

Challenge: Implement stratified K-fold from scratch: ensure each fold has approximately the same class balance as the full dataset. Compare CV scores with and without stratification on a heavily imbalanced dataset (90% class 0, 10% class 1).
Variant: Re-run the main exercise with K=3 and K=10. How does mean CV score and its standard deviation change? Why is K=10 called “leave-more-out” when fold sizes are larger?
Debug: The code below has a bug — fold indices overlap because the range end is wrong. Find and fix it.

Try it — edit and run (Shift+Enter)

Conceptual: Why is the harmonic mean (used in F1) more appropriate than the arithmetic mean when averaging precision and recall? Does the same logic apply to averaging CV fold scores?
Recall: From memory, describe the K-fold algorithm in 4 steps. Then write the formula for the CV score as the average of fold test scores.