Learning objectives

  • Implement 2D convolution from scratch in NumPy
  • Implement 2×2 max pooling
  • Explain how CNNs extract spatial features from images
  • Describe how Atari DQN uses a 3-layer CNN to process raw pixel observations

Concept and real-world motivation

A Convolutional Neural Network (CNN) learns spatial features from images using filters (small weight matrices). Each filter slides across the image and produces a feature map — a 2D response showing where that pattern appears. Early filters detect edges; deeper filters combine edges into textures, shapes, and objects.

Pooling reduces the spatial size of feature maps, making the network less sensitive to small shifts in position (translation invariance) and reducing computation.

In RL: Atari DQN (Mnih et al., 2015) uses 3 convolutional layers to process stacked 84×84 grayscale frames before the fully-connected Q-value head. The CNN is the state encoder — it transforms raw pixels into a compact vector that the Q-network can reason about. Without the CNN, the Q-network would need to process 84×84×4 = 28,224 input features directly.

Math:

2D convolution (valid padding): \((I * K)[i,j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m,n]\)

For an \(H \times W\) input and \(k \times k\) kernel, the output size is \((H-k+1) \times (W-k+1)\).

Max pooling: divide the feature map into non-overlapping regions; take the maximum value from each region.

Illustration — Atari DQN CNN architecture:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Input Image (84×84×4 stacked frames)
Conv Layer 1 (32 filters, 8×8, stride 4) → ReLU
Conv Layer 2 (64 filters, 4×4, stride 2) → ReLU
Conv Layer 3 (64 filters, 3×3, stride 1) → ReLU
Flatten → 3136 features
FC Layer (512 units) → ReLU
Q-values (one per action)

Exercise: Implement 2D convolution and max pooling from scratch in NumPy.

Try it — edit and run (Shift+Enter)

Professor’s hints

  • The valid convolution output size is always smaller than the input. Use same padding (zero-pad the input) if you want the output to have the same size.
  • The Laplacian filter [[-1,-1,-1],[-1,8,-1],[-1,-1,-1]] detects regions of rapid change (edges). Positive response means the center is brighter than its surroundings.
  • In practice, CNNs learn the filter values by backpropagation — we don’t hand-design them.
  • Each filter produces one channel in the output. Stacking many filters = many channels = richer representations.

Common pitfalls

  • Off-by-one errors in the output size: remember it’s H - k + 1 for valid conv.
  • Confusing “stride” (step size when sliding the filter) with “dilation” (spacing between filter elements).
  • Applying max pooling before activation functions instead of after.
Worked solution

Convolution by hand on the center patch of the 5×5 image:

  • Patch: rows 1-3, cols 1-3 of arange(25).reshape(5,5) = [[6,7,8],[11,12,13],[16,17,18]]
  • Laplacian response: 8×12 - (6+7+8+11+13+16+17+18) = 96 - 96 = 0 (center is average of surroundings → no edge)

For max pooling on a 3×3 output, we can only pool with size 2 on the top-left 2×2 region since 3//2=1.

Extra practice

  1. Warm-up: Apply the edge-detection kernel to one 3×3 patch by hand. Use the top-left 3×3 region of the 5×5 image (rows 0-2, cols 0-2).

    Try it — edit and run (Shift+Enter)

  2. Coding: Extend conv2d_valid to support a configurable stride. For stride=2, output size is (H - k) // stride + 1. This is what Atari DQN uses in its first two conv layers.

  3. Challenge: Implement a multi-channel convolution: input has 3 channels (like RGB), filter has shape (3, k, k), and the output is the sum of convolutions across input channels. This is how real CNNs process color images.

  4. Variant: Implement average pooling and compare to max pooling on the edge-detection feature map above. When might average pooling be preferable?

  5. Debug: Fix the convolution with wrong loop bounds (off by 1):

    Try it — edit and run (Shift+Enter)

  6. Notebook: For a full PyTorch CNN implementation, use the local notebook:

CNN in PyTorch (run locally)
  1. Recall: In your own words: (a) What does a convolutional filter detect? (b) Why does DQN need a CNN instead of a plain MLP for Atari? (c) What is the effect of increasing the filter size (e.g. 3×3 → 5×5)?