Model Evaluation

Learning objectives

Explain why we must evaluate on held-out data, not training data.
Construct a confusion matrix and compute TP, TN, FP, FN by hand.
Calculate accuracy, precision, recall, and F1 from a confusion matrix.

Concept and real-world motivation

Imagine studying for an exam by memorizing the answer key. You would score 100% on that exact sheet — but fail any new questions. The same trap exists in ML: if you test a model on the data it trained on, you measure memorization, not learning. The fix is a train/test split: keep a portion of data completely separate, train the model on the rest, and evaluate only on the held-out test set.

Beyond simple accuracy, we need richer metrics. Precision measures “of all positive predictions, how many were actually positive?” Recall measures “of all actual positives, how many did we catch?” F1 is their harmonic mean — a single number that balances both. In RL, we evaluate agents on new environments or random seeds they were never trained on — exactly the same idea of honest held-out evaluation.

Illustration: Metric comparison for a sample classifier.

Exercise: Given predictions = [1,0,1,1,0,1,0,0] and true_labels = [1,0,0,1,0,1,1,0]:

Compute the confusion matrix (TP, TN, FP, FN) by hand.
Compute accuracy, precision, recall, and F1.
Verify all four metrics with sklearn.metrics.

Try it — edit and run (Shift+Enter)

Professor’s hints

Loop over zip(predictions, true_labels) and accumulate four counters, one per confusion matrix cell.
Precision’s denominator is everything you predicted as positive (TP + FP). Recall’s denominator is everything that was actually positive (TP + FN).
F1 uses the harmonic mean: \(F_1 = \frac{2PR}{P + R}\). Harmonic means penalize extreme imbalances between P and R more than the arithmetic mean would.

Common pitfalls

Testing on training data: Never call score() or compute metrics on your training set to claim model quality. Always use held-out data.
Swapping FP and FN: FP = you said positive, it was negative. FN = you said negative, it was positive. Get the denominators of precision and recall right.
Using accuracy alone on imbalanced data: If 95% of samples are class 0, a model that always predicts 0 gets 95% accuracy. Precision and recall reveal the truth.

Worked solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
predictions = [1, 0, 1, 1, 0, 1, 0, 0]
true_labels  = [1, 0, 0, 1, 0, 1, 1, 0]

TP = sum(p == 1 and t == 1 for p, t in zip(predictions, true_labels))  # 3
TN = sum(p == 0 and t == 0 for p, t in zip(predictions, true_labels))  # 3
FP = sum(p == 1 and t == 0 for p, t in zip(predictions, true_labels))  # 1
FN = sum(p == 0 and t == 1 for p, t in zip(predictions, true_labels))  # 1

accuracy  = (TP + TN) / len(predictions)   # 0.75
precision = TP / (TP + FP)                 # 3/4 = 0.75... wait TP=3, FP=1 → 0.75
recall    = TP / (TP + FN)                 # 3/4 = 0.75
f1        = 2 * precision * recall / (precision + recall)  # 0.75

# sklearn verification
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print(accuracy_score(true_labels, predictions))
print(precision_score(true_labels, predictions))
print(recall_score(true_labels, predictions))
print(f1_score(true_labels, predictions))

Extra practice

Warm-up: For 5 predictions [1,0,1,0,1] and true labels [1,1,1,0,0], compute accuracy by hand. Count TP, TN, FP, FN, then divide.

Try it — edit and run (Shift+Enter)

Coding: Write a function confusion_counts(y_true, y_pred) that returns a dict {'TP': ..., 'TN': ..., 'FP': ..., 'FN': ...} for any binary classification lists.
Challenge: Generate a random binary classifier (np.random.randint(0, 2, 100)) and a random true label array. Compute all four metrics. Then try a classifier that always predicts 1 — compare precision and recall.
Variant: Change predictions[6] from 0 to 1. Recompute precision and recall. Which one changes? Why?
Debug: The code below has a bug — FP and FN are swapped in the precision and recall formulas. Find and fix it.

Try it — edit and run (Shift+Enter)

Conceptual: Give a real-world example where high recall matters more than high precision (e.g. cancer screening). Now give one where high precision matters more.
Recall: From memory, write the formulas for precision, recall, and F1 in terms of TP, TN, FP, FN.