What this section covers
Deep learning is the technology that transformed reinforcement learning from a research curiosity into a practical tool for solving hard problems. Before AlphaGo, DQN, and PPO, RL was limited to tiny, hand-crafted state spaces. Deep neural networks changed everything by serving as powerful function approximators β able to map raw pixels to values, states to action probabilities, and observations to policies.
This section builds deep learning from the ground up, starting with the biological inspiration for artificial neurons and progressing through multi-layer networks, forward propagation, loss functions, and backpropagation. Every concept is introduced with explicit connections to RL algorithms so you always know why you are learning it.
Topics covered:
- From biological neurons to artificial neurons: inputs, weights, bias, activation
- The perceptron: the simplest learning rule, AND gate, XOR limitations
- Activation functions: ReLU, sigmoid, tanh, softmax β when and why
- Multi-layer perceptrons: architecture, parameter counting, solving XOR
- Forward propagation: layer-by-layer computation, intermediate activations
- Loss functions: MSE for regression, cross-entropy for classification
- Backpropagation: chain rule, computing gradients, updating weights
- Gradient descent for neural networks: learning rate, momentum, Adam
- Training a neural network: mini-batches, epochs, training loop
- Regularization: dropout, weight decay, early stopping
- Convolutional neural networks: filters, pooling, feature maps
- Batch normalization and residual connections
- The complete DQN network: putting it all together
Why deep learning matters for RL
DQN is just Q-learning where the Q-function is a neural network.
That single sentence captures everything. In tabular Q-learning, we store a table Q[s, a] with one entry per (state, action) pair. This works for toy problems with a handful of states. For Atari games with 210Γ160 pixels, the state space is astronomically large β a table is impossible. The solution: replace the table with a neural network that takes the state as input and outputs Q-values for all actions.
| DL concept | Where it reappears in RL |
|---|---|
| Artificial neuron | Building block of all value and policy networks |
| Forward propagation | Computing Q(s,a) or Ο(a|s) during inference |
| Loss function (MSE) | DQN loss: \((r + \gamma \max_{a’} Q(s’, a’) - Q(s,a))^2\) |
| Loss function (cross-entropy) | Policy gradient loss |
| Backpropagation | How Q-networks and policy networks are trained |
| ReLU activations | Standard hidden-layer activation in DQN, A3C, PPO |
| Softmax | Action probability distribution in policy networks |
| Batch normalization | Stabilizing training in deep RL |
| Convolutional layers | Processing raw pixel observations in Atari DQN |
| Gradient descent / Adam | Optimizing all modern RL networks |
Policy gradient methods go further: instead of approximating a value function, they parameterize the policy itself as a neural network Ο(a|s; ΞΈ) and optimize the expected return directly using gradient ascent. Actorβcritic methods combine both: a policy network (actor) and a value network (critic), both trained with backpropagation.
Pedagogical approach: NumPy first
We implement everything in NumPy first. PyTorch is introduced via linked notebooks.
This is intentional. Implementing a neural network forward pass in NumPy β manually computing matrix multiplications, writing the ReLU function, computing the softmax β gives you a deep understanding of what the framework does for you. When you later call torch.nn.Linear or loss.backward(), you will know exactly what is happening inside.
The in-browser pyrepl exercises use NumPy exclusively because the browser environment (Pyodide) does not support PyTorch. Every concept is fully implementable in NumPy, and the implementations here are pedagogically superior to framework code for learning purposes.
The linked JupyterLite notebooks (see each page) extend the exercises and transition to PyTorch once the concepts are solid.
Table of contents
| # | Page | Topic |
|---|---|---|
| 1 | Biological Inspiration | Brain neurons β artificial neurons |
| 2 | The Perceptron | Perceptron learning rule, AND, XOR limits |
| 3 | Activation Functions | ReLU, sigmoid, tanh, softmax |
| 4 | Multi-Layer Perceptrons | Architecture, parameter counting, XOR solved |
| 5 | Forward Propagation | Layer-by-layer computation, batch forward pass |
| 6 | Loss Functions | MSE, cross-entropy, loss landscape |
| 7 | Backpropagation | Chain rule, gradients, numerical verification |
| 8 | Gradient Descent for NNs | Learning rate, momentum, Adam |
| 9 | Training Loop | Mini-batches, epochs, monitoring |
| 10 | Regularization | Dropout, weight decay, early stopping |
| 11 | Convolutional Neural Networks | Filters, pooling, feature maps |
| 12 | Batch Norm and Residuals | Normalization, skip connections |
| 13 | The DQN Network | Putting it all together for Atari |
Quick-start guide
- Complete pages in order. Each page builds on the previous one. The concepts are cumulative.
- Do every pyrepl exercise. They run in your browser β no setup needed. The struggle of implementing in NumPy is where the understanding happens.
- Check worked solutions only after a genuine attempt.
- Use the extra practice items. Debug exercises (item 5) are especially valuable β recognizing broken code trains the same skill as writing correct code.
- Open the JupyterLite notebooks for extended practice and PyTorch equivalents.
Estimated time: 2β4 hours per page. The full section takes approximately 30β50 hours.
Assessment checkpoints
- After page 3 β Checkpoint A: Neurons and Activations β Can you implement a neuron and all four activations from scratch in NumPy?
- After page 6 β Checkpoint B: Forward Pass and Loss β Can you implement a full forward pass and compute MSE and cross-entropy?
- After page 9 β Checkpoint C: Backprop and Training β Can you implement backpropagation and a training loop from scratch?
- After page 13 β Checkpoint D: DQN Architecture β Can you describe the DQN network architecture and explain why each component is needed?