NumPy
Used in Preliminary: NumPy and throughout the curriculum for state/observation arrays, reward vectors, and batch operations. RL environments return observations as arrays; neural networks consume batches of arrays—NumPy is the standard bridge. Why NumPy matters for RL Arrays — States and observations are vectors or matrices; rewards over time are 1D arrays. np.zeros(), np.array(), np.arange() are used constantly. Indexing and slicing — Extract rows/columns, mask by condition, gather batches. Fancy indexing appears in replay buffers and minibatches. Broadcasting — Apply operations across shapes without writing loops (e.g. subtract mean from a batch). Random — np.random for \(\epsilon\)-greedy, environment stochasticity, and reproducible seeds. Math — np.sum, np.mean, dot products, element-wise ops. No need for Python loops over elements. Core concepts with examples Creating arrays 1 2 3 4 5 6 7 8 9 10 11 12 13 import numpy as np # Preallocate for states (e.g. 4D state for CartPole) state = np.zeros(4) state = np.array([0.1, -0.2, 0.05, 0.0]) # Grid of values (e.g. for value function over 2D grid) grid = np.zeros((3, 3)) grid[0] = [1, 2, 3] # Ranges and linspace steps = np.arange(0, 1000, 1) # 0, 1, ..., 999 x = np.linspace(0, 1, 11) # 11 points from 0 to 1 Shape, reshape, and batch dimension 1 2 3 arr = np.array([[1, 2], [3, 4], [5, 6]]) # shape (3, 2) batch = arr.reshape(1, 3, 2) # (1, 3, 2) for "1 sample" flat = arr.flatten() # (6,) Indexing and slicing 1 2 3 4 5 6 7 8 9 10 11 12 # Slicing: first two rows, all columns arr[:2, :] # Last row arr[-1, :] # Boolean mask: rows where first column > 2 mask = arr[:, 0] > 2 arr[mask] # Integer indexing: rows 0 and 2 arr[[0, 2], :] Broadcasting and element-wise ops 1 2 3 4 5 6 7 8 # Subtract mean from each column X = np.random.randn(32, 4) # 32 samples, 4 features X_centered = X - X.mean(axis=0) # Element-wise product (e.g. importance weights) a = np.array([1.0, 2.0, 0.5]) b = np.array([1.0, 1.0, 2.0]) a * b # array([1., 2., 1.]) Random and seeding 1 2 3 4 5 6 7 np.random.seed(42) # Unit Gaussian (for bandit rewards, noise) samples = np.random.randn(10) # Uniform [0, 1) u = np.random.rand(5) # Random integers in [low, high) action = np.random.randint(0, 4) # one of 0,1,2,3 Useful reductions 1 2 3 4 5 arr = np.array([[1, 2], [3, 4], [5, 6]]) arr.sum() # 21 arr.sum(axis=0) # [9, 12] arr.mean(axis=1) # [1.5, 3.5, 5.5] np.max(arr, axis=0) # [5, 6] Worked examples Example 1 — Discounted return (Exercise 7). Given rewards = np.array([0.0, 0.0, 1.0]) and gamma = 0.9, compute \(G_0 = r_0 + \gamma r_1 + \gamma^2 r_2\) using NumPy. ...