Learning objectives
- Define features (X) and labels (y) and explain the role of each in supervised learning.
- Load and inspect a dataset using pandas:
.head(),.describe(),.shape,.dtypes. - Split a DataFrame into a feature matrix X and a label vector y.
Concept and real-world motivation
Every supervised learning problem has the same shape: a table of examples where each row is one observation and each column is one feature (an input measurement). One special column is the label — the thing we want to predict. The feature matrix is called X (capital, because it is a matrix) and the label vector is called y (lower-case, because it is a vector). When you train a model, you show it (X, y) pairs so it can learn the function \(f: X \to y\).
In practice, data arrives as CSV files, databases, or API responses. The pandas library gives us a DataFrame — a table with named columns — which is the standard container for ML data in Python. Before touching any model, a practitioner always inspects the data: How many samples? How many features? Any missing values? What are the value ranges? This step is called exploratory data analysis (EDA), and skipping it is the single most common source of silent bugs in ML pipelines.
RL connection: In RL, the agent observes a state at each timestep. That state is a vector of numbers — position, velocity, sensor readings, pixel values. State observations are features. When we later approximate the value function as \(V(s) \approx w^T s\), we are treating the state exactly like an X matrix from supervised learning. Every trick you learn here for handling features applies directly to state representations in RL.
Illustration: The diagram below shows how a DataFrame maps onto the supervised learning framework.
Here is what a small health dataset looks like in pandas — run this to inspect it:
Exercise: Using the same dataset, split it into features X and label y. Then explore X with .shape, .columns, and .values. Finally, check the label distribution with y.value_counts().
Professor’s hints
- To select multiple columns from a DataFrame:
df[['col1', 'col2', 'col3']](double brackets → DataFrame). - To select a single column as a Series:
df['col'](single brackets → Series). X = df.drop('healthy', axis=1)is a clean way to get all columns except the label.y = df['healthy']gets the label column as a 1-D Series..shapereturns(n_samples, n_features)— always check this first to make sure the split is correct.
Common pitfalls
- Including the label in X: Always double-check that X does not contain the y column. If the model sees the answer during training, it will appear perfect but fail completely on new data.
- Forgetting to reset the index: After filtering rows, pandas row indices may not start at 0. Call
.reset_index(drop=True)before training to avoid index-related errors. - Treating all columns as features: Columns like “customer ID” or “timestamp” are identifiers, not features. Including them can cause the model to memorize IDs rather than learn patterns.
Worked solution
Split X and y correctly, then explore:
| |
Key takeaway: X is a (5, 3) matrix — 5 samples, 3 features. y is a length-5 vector.
Extra practice
- Warm-up: A dataset has columns:
[user_id, age, income, city, clicked_ad]. Which columns should go in X? Which in y? Which should you drop entirely and why? - Coding: Create a new DataFrame with 8 samples and 4 features of your choice. Add a binary label column. Split into X and y, then print X.describe() to see feature statistics.
- Challenge: Some datasets have missing values (NaN). Add a
Nonevalue to the weight column in the example dataset. Then usedf.isnull().sum()to detect it, anddf.fillna(df.mean())to fill it. How does this affect.describe()? - Variant: Use
df.dtypesto see the data type of each column. What happens if the ‘healthy’ column is stored as a string'yes'/'no'instead of 1/0? How would you convert it? - Debug: The code below swaps X and y. Find and fix the bug.
- Conceptual: In a supervised learning problem, what is the difference between a sample and a feature? Give an example where confusing the two would cause a training error.
- Recall: State the convention for naming the feature matrix and label vector in Python ML (what letters are used and why they are capitalized the way they are).