Learning objectives
- Understand PyTorch’s
nn.Modulestructure and how it differs from NumPy implementations - Build a QNetwork and PolicyNetwork using
nn.Linearandnn.Sequential - Understand the training step:
optimizer.zero_grad(),loss.backward(),optimizer.step() - Map the NumPy MLP you built to its PyTorch equivalent
Concept and real-world motivation
PyTorch provides automatic differentiation (autograd) — you write the forward pass, and PyTorch computes all gradients automatically via loss.backward(). This replaces the hand-coded backprop you implemented in NumPy. The nn.Module class is the building block for all networks: it tracks parameters, enables gradient flow, and handles training vs. evaluation modes.
This page shows PyTorch syntax. Since PyTorch doesn’t run in the browser, use the linked notebook for hands-on practice.
In RL: All major RL frameworks use PyTorch nn.Module for policies and value functions. Stable-Baselines3, CleanRL, RLlib, and most research code define their networks as nn.Module subclasses. Understanding this pattern lets you read any modern RL codebase.
From NumPy to PyTorch: side by side
NumPy forward pass (what you’ve been doing):
| |
PyTorch equivalent:
| |
The key difference: PyTorch tracks gradients automatically. No need to write backprop by hand.
QNetwork for DQN
| |
PolicyNetwork with softmax output
| |
Training step
| |
The three-line pattern zero_grad → backward → step replaces all the hand-coded gradient math.
Mapping NumPy to PyTorch
| NumPy | PyTorch |
|---|---|
W1 = np.random.randn(...) | nn.Linear(in, out) |
z = x @ W.T + b | self.fc(x) |
np.maximum(0, z) | F.relu(z) |
| Manual gradient update | optimizer.step() |
| Your backprop code | loss.backward() |
W -= lr * dW | optim.SGD(...) or optim.Adam(...) |
Exercise: NumPy equivalent — implement the forward pass of a 2-layer network matching the PyTorch QNetwork above.
Professor’s hints
- Always call
optimizer.zero_grad()beforeloss.backward(). Forgetting this accumulates gradients across steps and produces wrong updates. - Use
model.eval()andtorch.no_grad()during evaluation to disable dropout and skip gradient tracking (saves memory and compute). nn.Sequentialis great for simple feed-forward networks. Use a customnn.Modulesubclass for anything with skip connections or multiple outputs.- Gradient clipping (
torch.nn.utils.clip_grad_norm_) is commonly used in RL to prevent exploding gradients.
Common pitfalls
- Forgetting
optimizer.zero_grad()— gradients accumulate by default in PyTorch. - Calling
model.train()vsmodel.eval()at the wrong times (dropout/batchnorm behave differently in each mode). - Using
loss.item()to log the scalar value (notlossitself, which holds a computation graph).
Worked solution — DQN-style QNetwork
| |
Extra practice
- Notebook practice: Complete the PyTorch exercises in the local notebook:
Coding: In the notebook, implement
train_stepfor aPolicyNetworkusing cross-entropy loss. Usetorch.distributions.Categoricalto compute the log-probability of actions.Challenge: In the notebook, implement a target network: create a second
QNetworkwith the same architecture, and periodically copy weights usingtarget_net.load_state_dict(online_net.state_dict()).Variant: Modify
QNetworkto output both Q-values and a value estimate (for advantage computation). This is the dueling DQN architecture:Q(s,a) = V(s) + A(s,a).Debug: The training step below is missing
optimizer.zero_grad(). Describe what happens to the gradients and fix it:Try it — edit and run (Shift+Enter)Conceptual: Why does PyTorch’s
autogradreplace the need to write backprop by hand? What does the computation graph track?Recall: Name the three steps in every PyTorch training step. What does each one do?