Learning objectives
- Execute a complete ML workflow from raw data to model comparison.
- Apply
StandardScaler, train multiple classifiers, and evaluate with accuracy, precision, and recall. - Interpret results and make a justified model choice.
Concept and real-world motivation
This page is a mini-project that integrates every concept from the ML Foundations section. There is no new theory — only application. Real ML work looks exactly like this: load data, explore it, preprocess, train several models, evaluate honestly on held-out data, and compare results systematically.
The same workflow applies to RL evaluation: load or generate trajectories, preprocess states, train a value function or policy, evaluate on unseen episodes, and compare agent variants. The “best model” in supervised learning is the one with the best test metrics; the “best agent” in RL is the one that maximizes expected return across new environments. This project is your bridge between the two worlds.
Illustration: Compare accuracy across three classifiers.
Exercise — Full pipeline on the Wine dataset (Steps 1–4):
Load and explore the Wine dataset, preprocess, and train three models.
Professor’s hints
scaler.fit_transform(X_train)fits AND transforms in one step. Thenscaler.transform(X_test)applies the SAME scaling (do not refit on test — that would be data leakage).precision_score(..., average='macro')averages precision across all 3 classes equally. Use'weighted'if classes are imbalanced.stratify=yintrain_test_splitensures all 3 wine classes appear in both train and test in the right proportions.
Common pitfalls
- Data leakage via scaler:
scaler.fit_transform(X)on all data before splitting leaks test statistics into training. Always fit the scaler only onX_train. - Forgetting
stratifyon multi-class data: Without it, small classes may vanish from the test set, making evaluation meaningless. - Comparing models trained with different preprocessing: All three models above use the same scaled data — that is fair. Comparing scaled LR to unscaled DT would not be.
Worked solution — preprocessing and training
| |
Extra practice
- Step 1–2 — Exploration: Load the Wine dataset and display a bar chart of class distribution and the mean value of each feature per class.
Coding: Add
cross_val_score(5-fold) for each of the three models. Report mean ± std. Do the CV scores agree with the single test-set scores?Challenge: Add a fourth model:
RandomForestClassifier(n_estimators=100, random_state=42). Compare all four models with a bar chart. Does the ensemble beat the individual models?Variant: Re-run the pipeline without
StandardScaler. How much does accuracy change forLogisticRegression? ForDecisionTreeClassifier? Explain why trees are scale-invariant.Debug: The code below has a bug —
StandardScaleris fit on the full dataset before the split, causing data leakage. Find and fix it.
Conceptual: Which model worked best on the Wine dataset in your run? Give one reason why logistic regression might outperform a decision tree on this dataset.
Recall: In 3 sentences, describe the full ML workflow you executed in this mini-project, from raw data to final model comparison.