Learning objectives
- Use the sklearn API (
fit/predict/score) consistently across different model classes. - Build a
Pipelinethat chains preprocessing and a classifier. - Compare multiple models on the same dataset using test-set accuracy.
Concept and real-world motivation
Scikit-learn provides a unified API: every model has fit(X_train, y_train), predict(X_test), and score(X_test, y_test). This consistency lets you swap models with one line of code. Pipelines extend this: they chain preprocessing steps (like StandardScaler) and a final estimator into a single object that can be fit, predicted, and cross-validated as a unit — preventing data leakage automatically.
In RL, we follow an analogous consistent pattern: initialize the agent → interact with the environment → update the policy → evaluate on new episodes. Just as sklearn pipelines chain steps, RL training loops chain environment steps, replay updates, and evaluation episodes. Having a consistent API makes experimentation fast.
Illustration: The sklearn pipeline flow.
Exercise: Run the full sklearn workflow on the Iris dataset: load, split, train two models, and compare.
Professor’s hints
- The sklearn pattern is always:
model.fit(X_train, y_train), theny_pred = model.predict(X_test), thenaccuracy_score(y_test, y_pred). Or shortcut:model.score(X_test, y_test). LogisticRegression(max_iter=200)— increasemax_iterif you see a ConvergenceWarning.DecisionTreeClassifier(random_state=42)sets the random seed for reproducibility. Without it, results may vary across runs.
Common pitfalls
- Calling
fiton test data: Nevermodel.fit(X_test, y_test). The model must only seeX_trainduring fitting. Evaluation uses a completely separateX_test. - Forgetting to scale for logistic regression: Logistic regression converges faster and more reliably when features are on the same scale. Use
StandardScalerin a Pipeline. - Using
scoreon training data to report model quality:model.score(X_train, y_train)is training accuracy — it measures memorization. Always reportmodel.score(X_test, y_test).
Worked solution
| |
Extra practice
- Warm-up: Use
PipelinewithStandardScalerandLogisticRegressionon Iris. Confirm you get the same or better accuracy than the unscaled version above.
- Coding: Run
cross_val_score(5-fold) on bothLogisticRegressionandDecisionTreeClassifierusing aPipelinewithStandardScaler. Report mean ± std for each. - Challenge: Compare 4 classifiers on Iris:
LogisticRegression,DecisionTreeClassifier,KNeighborsClassifier(n_neighbors=5), andRandomForestClassifier(n_estimators=100). Print a table of train accuracy and test accuracy for each. Which overfits the most? - Variant: Try
DecisionTreeClassifier(max_depth=1),max_depth=3, andmax_depth=None(unlimited). Plot test accuracy vs max_depth. At what depth does overfitting start? - Debug: The code below has a bug —
fitis called on the test set instead of the training set. Find and fix it.
- Conceptual: Why does a
Pipelineprevent data leakage during cross-validation, whereas fitting aStandardScalerseparately on all data before splitting does not? - Recall: From memory, write the three core sklearn API calls for a classification workflow, and explain what each does.