Decision Trees

Learning objectives

Describe how a decision tree splits data using if/else thresholds on features.
Compute entropy \(H(S)\) for a dataset with binary labels.
Calculate information gain for a proposed split and choose the best splitting threshold.

Concept and real-world motivation

A decision tree answers the question “which class does this belong to?” through a sequence of if/else questions. At each internal node, the tree picks a feature and a threshold (e.g. “is feature1 > 3?”). Points that satisfy the condition go left; the rest go right. This repeats until reaching leaf nodes that assign class labels. The tree learns by choosing splits that maximize information gain — the reduction in uncertainty (entropy) after the split.

Decision trees are appealing because they are interpretable: you can trace exactly why a prediction was made. They also handle non-linear boundaries naturally. In RL, decision trees appear as interpretable policy representations: instead of a black-box neural network, an agent’s policy can be a tree — “if speed > 5 and obstacle_ahead, then brake.” This allows human inspection of learned behaviors.

Illustration: A two-level decision tree splitting on feature1, then feature2.

graph TD A["feature1 > 3?"] -->|Yes| B["feature2 > 1?"] A -->|No| C["Class A"] B -->|Yes| D["Class B"] B -->|No| E["Class A"]

Exercise: For a dataset with feature x = [1,2,3,4,5,6] and labels = [0,0,0,1,1,1]:

Compute entropy of the full dataset.
Try split at x = 3.5: compute entropy of each side.
Compute information gain for this split.

Try it — edit and run (Shift+Enter)

Professor’s hints

Entropy formula: \(H(S) = -\sum_c p_c \log_2 p_c\). Use np.bincount(y) to count each class, divide by len(y) for probabilities.
Pure node (all one class) has entropy 0. Perfectly mixed node (50/50 binary) has entropy 1 bit. These are the extremes.
Information gain = \(H(\text{parent}) - \frac{|L|}{N} H(L) - \frac{|R|}{N} H(R)\). Maximum gain means the split perfectly separates classes.

Common pitfalls

Using log instead of log2: Entropy in bits uses \(\log_2\). Using natural log gives entropy in nats — numerically different. Both are valid measures, but be consistent.
Zero probability terms: If a class is absent from a split, \(p = 0\) and \(0 \times \log_2(0)\) is undefined mathematically but defined as 0 by convention. Filter out zero-count classes with counts[counts > 0].
Overfitting to full depth: A tree grown until every leaf is pure will memorize training data. Always use max_depth or min_samples_leaf to regularize.

Worked solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np

def entropy(y):
    n = len(y)
    if n == 0:
        return 0.0
    counts = np.bincount(y)
    probs = counts[counts > 0] / n
    return -np.sum(probs * np.log2(probs))

labels = np.array([0, 0, 0, 1, 1, 1])
left, right = labels[:3], labels[3:]

H_parent   = entropy(labels)         # 1.0 bit (perfectly mixed)
H_left     = entropy(left)           # 0.0 (all zeros)
H_right    = entropy(right)          # 0.0 (all ones)
n = len(labels)
weighted_H = (len(left)/n)*H_left + (len(right)/n)*H_right  # 0.0
info_gain  = H_parent - weighted_H   # 1.0 bit (perfect split!)
print(f'Information gain = {info_gain:.4f}')

Extra practice

Warm-up: Compute entropy by hand for a dataset with 4 positive and 2 negative examples. \(H = -\frac{4}{6}\log_2\frac{4}{6} - \frac{2}{6}\log_2\frac{2}{6}\).

Try it — edit and run (Shift+Enter)

Coding: Write a function best_split(X, y) that tries all possible thresholds (midpoints between consecutive sorted values) and returns the feature and threshold with maximum information gain.
Challenge: Implement a full decision tree (just depth-2) from scratch: at the root, find the best split; at each child, find the best split again; leaves predict majority class. Test on the 6-sample dataset from the exercise.
Variant: Change the labels to [0,0,1,0,1,1] (not cleanly separable at x=3.5). Recompute information gain for the same split. How does it compare to the 1.0-bit gain in the main exercise?
Debug: The code below has a bug — it uses np.log (natural log) instead of np.log2, making entropy values numerically wrong. Find and fix it.

Try it — edit and run (Shift+Enter)

Conceptual: Why is a decision tree prone to overfitting when grown to full depth? Name two hyperparameters you can set in sklearn’s DecisionTreeClassifier to control tree size.
Recall: From memory, write the entropy formula \(H(S)\) and the information gain formula in terms of parent and child entropies.