Used in Preliminary: PyTorch basics and in the curriculum for DQN, policy gradients, actor-critic, PPO, and SAC. PyTorch’s define-by-run style and clear autograd make it a natural fit for custom RL loss functions.
Why PyTorch matters for RL
- Tensors — States, actions, and batches are tensors.
torch.tensor(),requires_grad=True, and.to(device)are daily use. - Autograd — Policy gradient and value losses need gradients;
backward()and.gradare central. - nn.Module — Q-networks, policy networks, and critics are
nn.Modulesubclasses; parameters are collected for optimizers. - Optimizers —
torch.optim.Adam,zero_grad(),loss.backward(),optimizer.step(). - Device — Move model and data to GPU with
.to(device)for faster training.
Core concepts with examples
Tensors and gradients
| |
Batches and shapes
| |
Simple MLP with nn.Module
| |
Training step (e.g. MSE loss)
| |
Device and CPU/GPU
| |
Worked examples
Example 1 — Autograd (Exercise 1). Create \(x = 3.0\) with requires_grad=True, compute \(y = x^3 + 2x\), call y.backward(), and verify x.grad.
Solution
Step 1: x = torch.tensor(3.0, requires_grad=True). Step 2: y = x**3 + 2*x ⇒ y = 27 + 6 = 33. Step 3: y.backward(). Step 4: By hand, \(dy/dx = 3x^2 + 2\); at x=3 that is 27+2 = 29. So x.grad should be tensor(29.). PyTorch’s autograd applies the chain rule; we use the same mechanism for policy gradient and value loss in RL.
Example 2 — Training step. Given a network, a batch of inputs, and targets, perform one optimizer step (zero_grad, forward, loss, backward, step).
Solution
Step 1: optimizer.zero_grad() to clear old gradients. Step 2: pred = model(batch) then loss = F.mse_loss(pred, targets). Step 3: loss.backward() to compute gradients. Step 4: optimizer.step() to update parameters. Order matters: zero_grad → forward → loss → backward → step. In RL we do this for the critic (MSE to TD target) and for the policy (gradient ascent on return).
Exercises
Exercise 1. Create a scalar tensor \(x = 3.0\) with requires_grad=True. Compute \(y = x^3 + 2x\) and call y.backward(). Verify that x.grad equals \(3x^2 + 2\) evaluated at \(x=3\) (i.e. 29).
Exercise 2. Build a 2-layer MLP: nn.Sequential(nn.Linear(4, 64), nn.ReLU(), nn.Linear(64, 2)). Forward pass a batch of 10 states of dimension 4. Print the output shape. Then compute the mean squared error between the output and a random target tensor of shape (10, 2), call backward() on the loss, and confirm that the first layer’s weight has non-zero gradients.
Exercise 3. Implement a function epsilon_greedy(q_values, epsilon) that takes a 1D tensor q_values of length \(n\) and returns an integer action: with probability \(\epsilon\) sample uniformly from \(0..n-1\), otherwise return argmax. Use torch.rand(1).item() for the random draw and q_values.argmax().item() for the greedy action. No gradients needed.
Exercise 4. Create a policy network that outputs logits for 2 actions: nn.Linear(4, 2). Given a state batch of shape (8, 4), compute action probabilities with F.softmax(logits, dim=-1) and sample 8 actions using the probabilities (e.g. torch.multinomial(probs, 1).squeeze(-1)). Then compute the log-probability of those actions: F.log_softmax(logits, dim=-1) and gather the chosen action log-probs. Return both actions and log_probs.
Exercise 5. Implement a training loop: (1) create the QNetwork above and an Adam optimizer; (2) for 100 steps, sample random states (32, 4) and random target Q-values (32, 2); (3) compute MSE loss, backward, step; (4) every 20 steps print the loss. Confirm the loss decreases.
Exercise 6. Create a tensor x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) and compute y = x.sum() then y.backward(). What is x.grad? In RL: Summing over a batch of losses is common; gradients flow back to each element.
Exercise 7. Build a small network that maps state dim 4 to 2 action logits. Given a batch of states (8, 4), compute log-probs with F.log_softmax(logits, dim=-1). Use torch.gather to select the log-prob of a given action index (e.g. actions = [0, 1, 0, 1, 0, 1, 0, 1]). Return a tensor of shape (8,). In RL: This is the log-probability term in the policy gradient.
Exercise 8. (Challenge) Implement a target network: create two identical QNetwork instances, q and q_target. Every 10 training steps, copy q parameters into q_target with q_target.load_state_dict(q.state_dict()). Train q with MSE to q_target (next-state values). In RL: Target networks stabilize DQN.
Professor’s hints
- Always call
optimizer.zero_grad()beforeloss.backward(); otherwise gradients accumulate across steps and your update is wrong. - Use
loss.backward()thenoptimizer.step()in that order. Do not callbackward()twice on the same graph without re-running the forward pass. - In RL: Policy gradient maximizes return, so you often use
loss = -log_prob * advantageand minimizeloss; the minus sign turns the gradient into ascent on return. - For reproducibility, set
torch.manual_seed(42)andnp.random.seed(42)at the start of training.
Common pitfalls
- Forgetting
optimizer.zero_grad(): Gradients add by default. Without zeroing, the second step uses gradients from step 1 + step 2, which is rarely what you want. - Using in-place operations on tensors that require grad: e.g.
x.add_(1)can break the graph. Preferx = x + 1or out-of-place ops whenx.requires_gradis True. - Mixing CPU and GPU tensors: Ensure model and batch are on the same device. Use
.to(device)consistently. Callingmodel(batch)when one is on CPU and the other on GPU raises an error. - Taking
.item()or indexing before backward: If you need a Python scalar (e.g. for logging), use.item()on a scalar tensor only after you are done with the computation graph, or clone/detach so backward is not affected.
Docs: pytorch.org/docs. Used heavily in Volumes 3–5 (value approximation, policy gradients, PPO, SAC).