NumPy

Used in Preliminary: NumPy and throughout the curriculum for state/observation arrays, reward vectors, and batch operations. RL environments return observations as arrays; neural networks consume batches of arrays—NumPy is the standard bridge. Why NumPy matters for RL Arrays — States and observations are vectors or matrices; rewards over time are 1D arrays. np.zeros(), np.array(), np.arange() are used constantly. Indexing and slicing — Extract rows/columns, mask by condition, gather batches. Fancy indexing appears in replay buffers and minibatches. Broadcasting — Apply operations across shapes without writing loops (e.g. subtract mean from a batch). Random — np.random for \(\epsilon\)-greedy, environment stochasticity, and reproducible seeds. Math — np.sum, np.mean, dot products, element-wise ops. No need for Python loops over elements. Core concepts with examples Creating arrays 1 2 3 4 5 6 7 8 9 10 11 12 13 import numpy as np # Preallocate for states (e.g. 4D state for CartPole) state = np.zeros(4) state = np.array([0.1, -0.2, 0.05, 0.0]) # Grid of values (e.g. for value function over 2D grid) grid = np.zeros((3, 3)) grid[0] = [1, 2, 3] # Ranges and linspace steps = np.arange(0, 1000, 1) # 0, 1, ..., 999 x = np.linspace(0, 1, 11) # 11 points from 0 to 1 Shape, reshape, and batch dimension 1 2 3 arr = np.array([[1, 2], [3, 4], [5, 6]]) # shape (3, 2) batch = arr.reshape(1, 3, 2) # (1, 3, 2) for "1 sample" flat = arr.flatten() # (6,) Indexing and slicing 1 2 3 4 5 6 7 8 9 10 11 12 # Slicing: first two rows, all columns arr[:2, :] # Last row arr[-1, :] # Boolean mask: rows where first column > 2 mask = arr[:, 0] > 2 arr[mask] # Integer indexing: rows 0 and 2 arr[[0, 2], :] Broadcasting and element-wise ops 1 2 3 4 5 6 7 8 # Subtract mean from each column X = np.random.randn(32, 4) # 32 samples, 4 features X_centered = X - X.mean(axis=0) # Element-wise product (e.g. importance weights) a = np.array([1.0, 2.0, 0.5]) b = np.array([1.0, 1.0, 2.0]) a * b # array([1., 2., 1.]) Random and seeding 1 2 3 4 5 6 7 np.random.seed(42) # Unit Gaussian (for bandit rewards, noise) samples = np.random.randn(10) # Uniform [0, 1) u = np.random.rand(5) # Random integers in [low, high) action = np.random.randint(0, 4) # one of 0,1,2,3 Useful reductions 1 2 3 4 5 arr = np.array([[1, 2], [3, 4], [5, 6]]) arr.sum() # 21 arr.sum(axis=0) # [9, 12] arr.mean(axis=1) # [1.5, 3.5, 5.5] np.max(arr, axis=0) # [5, 6] Worked examples Example 1 — Discounted return (Exercise 7). Given rewards = np.array([0.0, 0.0, 1.0]) and gamma = 0.9, compute \(G_0 = r_0 + \gamma r_1 + \gamma^2 r_2\) using NumPy. ...

March 10, 2026 · 6 min · 1184 words · codefrydev

PyTorch

Used in Preliminary: PyTorch basics and in the curriculum for DQN, policy gradients, actor-critic, PPO, and SAC. PyTorch’s define-by-run style and clear autograd make it a natural fit for custom RL loss functions. Why PyTorch matters for RL Tensors — States, actions, and batches are tensors. torch.tensor(), requires_grad=True, and .to(device) are daily use. Autograd — Policy gradient and value losses need gradients; backward() and .grad are central. nn.Module — Q-networks, policy networks, and critics are nn.Module subclasses; parameters are collected for optimizers. Optimizers — torch.optim.Adam, zero_grad(), loss.backward(), optimizer.step(). Device — Move model and data to GPU with .to(device) for faster training. Core concepts with examples Tensors and gradients 1 2 3 4 5 6 import torch x = torch.tensor(2.0, requires_grad=True) y = x**2 y.backward() print(x.grad) # 4.0 Batches and shapes 1 2 3 4 5 # Batch of 32 states, 4 features (e.g. CartPole) states = torch.randn(32, 4) # Linear layer: 4 -> 64 W = torch.randn(4, 64, requires_grad=True) out = states @ W # (32, 64) Simple MLP with nn.Module 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import torch.nn as nn class QNetwork(nn.Module): def __init__(self, state_dim=4, n_actions=2, hidden=64): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, n_actions), ) def forward(self, x): return self.net(x) q = QNetwork() s = torch.randn(8, 4) # batch 8 q_vals = q(s) # (8, 2) Training step (e.g. MSE loss) 1 2 3 4 5 6 optimizer = torch.optim.Adam(q.parameters(), lr=1e-3) targets = torch.randn(8, 2) loss = nn.functional.mse_loss(q_vals, targets) optimizer.zero_grad() loss.backward() optimizer.step() Device and CPU/GPU 1 2 3 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") q = q.to(device) states = states.to(device) Worked examples Example 1 — Autograd (Exercise 1). Create \(x = 3.0\) with requires_grad=True, compute \(y = x^3 + 2x\), call y.backward(), and verify x.grad. ...

March 10, 2026 · 5 min · 1052 words · codefrydev

TensorFlow

Alternative to PyTorch for implementing DQN, policy gradients, and other deep RL algorithms. The Keras API provides layers and optimizers; GradientTape gives full control over custom loss functions (e.g. policy gradient, CQL). Why TensorFlow matters for RL Keras API — tf.keras.Sequential, tf.keras.Model, layers (Dense, Conv2D). Quick prototyping of Q-networks and policies. Gradient tape — tf.GradientTape() records operations so you can compute gradients of any scalar with respect to trainable variables. Essential for policy gradient and custom losses. Optimizers — tf.keras.optimizers.Adam, apply_gradients. Device placement — GPU via tf.config when available. Core concepts with examples Dense layers and Sequential model 1 2 3 4 5 6 7 8 import tensorflow as tf model = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation="relu", input_shape=(4,)), tf.keras.layers.Dense(64, activation="relu"), tf.keras.layers.Dense(2), # Q-values for 2 actions ]) model.build(input_shape=(None, 4)) Forward pass and MSE loss 1 2 3 4 states = tf.random.normal((32, 4)) q_values = model(states) targets = tf.random.normal((32, 2)) loss = tf.reduce_mean((q_values - targets) ** 2) Training step with GradientTape 1 2 3 4 5 6 7 8 9 10 11 12 optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) @tf.function def train_step(states, targets): with tf.GradientTape() as tape: q_values = model(states) loss = tf.reduce_mean((q_values - targets) ** 2) grads = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(grads, model.trainable_variables)) return loss loss_val = train_step(states, targets) Subclassing for custom models 1 2 3 4 5 6 7 8 9 10 11 class QNetwork(tf.keras.Model): def __init__(self, n_actions=2): super().__init__() self.d1 = tf.keras.layers.Dense(64, activation="relu") self.d2 = tf.keras.layers.Dense(64, activation="relu") self.out = tf.keras.layers.Dense(n_actions) def call(self, x): x = self.d1(x) x = self.d2(x) return self.out(x) Exercises Exercise 1. Create a Sequential model with one hidden layer (64 units, ReLU) and output dimension 2. Build it with input_shape=(4,). Call model(tf.random.normal((10, 4))) and print the output shape. Then use model.summary() to inspect parameters. ...

March 10, 2026 · 4 min · 782 words · codefrydev

Chapter 94: RL in Recommender Systems

Learning objectives Build a toy recommender: 100 items, a user model with changing preferences (e.g. latent state that drifts or has context-dependent taste). Define state (e.g. user history, current context), action (which item to show), and reward (e.g. click, watch time, or engagement score). Train an agent with a policy gradient method (e.g. REINFORCE or PPO) to maximize long-term engagement (e.g. cumulative clicks or cumulative reward over a session). Compare with a baseline (e.g. random or greedy to current preference) and report engagement over episodes. Relate the formulation to the recommendation anchor (state = user context, action = item, return = long-term satisfaction). Concept and real-world RL ...

March 10, 2026 · 4 min · 698 words · codefrydev

Chapter 99: Debugging RL Code

Learning objectives Take a broken RL implementation (e.g. SAC that does not learn or converges to poor return) and diagnose the issue systematically. Write unit tests for the environment (e.g. step returns correct shapes, reset works, reward is bounded), the replay buffer (e.g. sample returns correct batch shape, storage and sampling are consistent), and gradient shapes (e.g. critic loss backward produces gradients of the right shape). Add logging for Q-values (min, max, mean), rewards (per step and per episode), and entropy (or log_prob) so you can spot numerical issues, collapse, or scale problems. Identify the root cause (e.g. wrong sign, wrong target, learning rate, or reward scale) and fix it. Relate debugging practice to robot navigation and healthcare where bugs can be costly. Concept and real-world RL ...

March 10, 2026 · 4 min · 728 words · codefrydev