Learning objectives

  • Explain why K-Means is unsupervised and when it is appropriate.
  • Execute the assignment and update steps of K-Means by hand on a small dataset.
  • Identify convergence and understand the role of initialization.

Concept and real-world motivation

All the algorithms so far required labels. Unsupervised learning finds structure in data with no labels at all. K-Means groups unlabeled points into K clusters by alternating between two steps: (1) assignment — assign each point to the nearest centroid; (2) update — move each centroid to the mean of its assigned points. Repeating until assignments stop changing guarantees convergence (though not necessarily to the global optimum).

K-Means has countless applications: customer segmentation, image compression (quantize pixel colors), document clustering, and anomaly detection. In RL, we sometimes cluster similar states together to build a smaller abstract state space — reducing the problem size before training a policy. This is called state abstraction or tile coding.

Illustration: Final cluster sizes after K-Means converges on a 3-cluster dataset.

Exercise: Run 3 iterations of K-Means by hand on 6 points in 2D.

Try it — edit and run (Shift+Enter)

Professor’s hints

  • np.linalg.norm(centroids - p, axis=1) computes the distance from point p to each of the K centroids in one line. axis=1 sums across the 2 feature dimensions.
  • The update step is simply the mean: points[mask].mean(axis=0). If a cluster is empty (no points assigned), keep the old centroid or re-initialize — never divide by zero.
  • Convergence: when np.array_equal(assignments, prev_assignments). After iteration 2, the assignments in this example should be stable.

Common pitfalls

  • Dividing by zero in update: If a cluster gets no assigned points, mean([]) = NaN. Always check mask.sum() > 0.
  • Sensitivity to initialization: K-Means can converge to different local optima depending on initial centroids. Run multiple times with different seeds (sklearn’s n_init parameter) and keep the best result (lowest inertia).
  • Choosing K: K is a hyperparameter. Use the elbow method — plot inertia vs K — and look for the “elbow” where adding more clusters gives diminishing returns.
Worked solution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import numpy as np
points    = np.array([[1,1],[1,2],[2,1],[8,8],[8,9],[9,8]], dtype=float)
centroids = np.array([[1,1],[8,8],[5,5]], dtype=float)

for it in range(3):
    dists       = np.linalg.norm(points[:, None, :] - centroids[None, :, :], axis=2)
    assignments = np.argmin(dists, axis=1)
    new_c = np.zeros_like(centroids)
    for k in range(3):
        mask = assignments == k
        new_c[k] = points[mask].mean(axis=0) if mask.sum() > 0 else centroids[k]
    print(f'it{it+1}: assign={assignments}')
    centroids = new_c

Extra practice

  1. Warm-up: Plot the 6 points from the main exercise colored by their cluster assignment after the final iteration. Use a simple ASCII-style print or matplotlib scatter.
Try it — edit and run (Shift+Enter)
  1. Coding: Wrap K-Means into a function kmeans(points, K, n_iter=10) that returns final assignments and centroids. Test it with K=2 and K=3 on the 6-point dataset.
  2. Challenge: Generate 100 2D points from 3 Gaussian clusters using np.random.randn. Run K-Means with K=2, K=3, K=4. Compute inertia (sum of squared distances from each point to its centroid) for each K. Plot inertia vs K and identify the elbow.
  3. Variant: Try 3 different initial centroid choices on the 6-point dataset: (a) the ones in the exercise, (b) three random points, (c) the first three data points. Do all initializations converge to the same clusters?
  4. Debug: The code below has a bug — the centroid update divides by the total number of points instead of the cluster size. Find and fix it.
Try it — edit and run (Shift+Enter)
  1. Conceptual: K-Means minimizes within-cluster sum of squared distances (inertia). Is this guaranteed to find the global minimum? Why or why not?
  2. Recall: From memory, write the two steps of the K-Means algorithm (assignment and update) and state what “convergence” means.