Learning objectives
- Describe how a decision tree splits data using if/else thresholds on features.
- Compute entropy \(H(S)\) for a dataset with binary labels.
- Calculate information gain for a proposed split and choose the best splitting threshold.
Concept and real-world motivation
A decision tree answers the question “which class does this belong to?” through a sequence of if/else questions. At each internal node, the tree picks a feature and a threshold (e.g. “is feature1 > 3?”). Points that satisfy the condition go left; the rest go right. This repeats until reaching leaf nodes that assign class labels. The tree learns by choosing splits that maximize information gain — the reduction in uncertainty (entropy) after the split.
Decision trees are appealing because they are interpretable: you can trace exactly why a prediction was made. They also handle non-linear boundaries naturally. In RL, decision trees appear as interpretable policy representations: instead of a black-box neural network, an agent’s policy can be a tree — “if speed > 5 and obstacle_ahead, then brake.” This allows human inspection of learned behaviors.
Illustration: A two-level decision tree splitting on feature1, then feature2.
Exercise: For a dataset with feature x = [1,2,3,4,5,6] and labels = [0,0,0,1,1,1]:
- Compute entropy of the full dataset.
- Try split at
x = 3.5: compute entropy of each side. - Compute information gain for this split.
Professor’s hints
- Entropy formula: \(H(S) = -\sum_c p_c \log_2 p_c\). Use
np.bincount(y)to count each class, divide bylen(y)for probabilities. - Pure node (all one class) has entropy 0. Perfectly mixed node (50/50 binary) has entropy 1 bit. These are the extremes.
- Information gain = \(H(\text{parent}) - \frac{|L|}{N} H(L) - \frac{|R|}{N} H(R)\). Maximum gain means the split perfectly separates classes.
Common pitfalls
- Using
loginstead oflog2: Entropy in bits uses \(\log_2\). Using natural log gives entropy in nats — numerically different. Both are valid measures, but be consistent. - Zero probability terms: If a class is absent from a split, \(p = 0\) and \(0 \times \log_2(0)\) is undefined mathematically but defined as 0 by convention. Filter out zero-count classes with
counts[counts > 0]. - Overfitting to full depth: A tree grown until every leaf is pure will memorize training data. Always use
max_depthormin_samples_leafto regularize.
Worked solution
| |
Extra practice
- Warm-up: Compute entropy by hand for a dataset with 4 positive and 2 negative examples. \(H = -\frac{4}{6}\log_2\frac{4}{6} - \frac{2}{6}\log_2\frac{2}{6}\).
- Coding: Write a function
best_split(X, y)that tries all possible thresholds (midpoints between consecutive sorted values) and returns the feature and threshold with maximum information gain. - Challenge: Implement a full decision tree (just depth-2) from scratch: at the root, find the best split; at each child, find the best split again; leaves predict majority class. Test on the 6-sample dataset from the exercise.
- Variant: Change the labels to
[0,0,1,0,1,1](not cleanly separable at x=3.5). Recompute information gain for the same split. How does it compare to the 1.0-bit gain in the main exercise? - Debug: The code below has a bug — it uses
np.log(natural log) instead ofnp.log2, making entropy values numerically wrong. Find and fix it.
- Conceptual: Why is a decision tree prone to overfitting when grown to full depth? Name two hyperparameters you can set in sklearn’s
DecisionTreeClassifierto control tree size. - Recall: From memory, write the entropy formula \(H(S)\) and the information gain formula in terms of parent and child entropies.