What this section covers

Deep learning is the technology that transformed reinforcement learning from a research curiosity into a practical tool for solving hard problems. Before AlphaGo, DQN, and PPO, RL was limited to tiny, hand-crafted state spaces. Deep neural networks changed everything by serving as powerful function approximators β€” able to map raw pixels to values, states to action probabilities, and observations to policies.

This section builds deep learning from the ground up, starting with the biological inspiration for artificial neurons and progressing through multi-layer networks, forward propagation, loss functions, and backpropagation. Every concept is introduced with explicit connections to RL algorithms so you always know why you are learning it.

Topics covered:

  • From biological neurons to artificial neurons: inputs, weights, bias, activation
  • The perceptron: the simplest learning rule, AND gate, XOR limitations
  • Activation functions: ReLU, sigmoid, tanh, softmax β€” when and why
  • Multi-layer perceptrons: architecture, parameter counting, solving XOR
  • Forward propagation: layer-by-layer computation, intermediate activations
  • Loss functions: MSE for regression, cross-entropy for classification
  • Backpropagation: chain rule, computing gradients, updating weights
  • Gradient descent for neural networks: learning rate, momentum, Adam
  • Training a neural network: mini-batches, epochs, training loop
  • Regularization: dropout, weight decay, early stopping
  • Convolutional neural networks: filters, pooling, feature maps
  • Batch normalization and residual connections
  • The complete DQN network: putting it all together

Why deep learning matters for RL

DQN is just Q-learning where the Q-function is a neural network.

That single sentence captures everything. In tabular Q-learning, we store a table Q[s, a] with one entry per (state, action) pair. This works for toy problems with a handful of states. For Atari games with 210Γ—160 pixels, the state space is astronomically large β€” a table is impossible. The solution: replace the table with a neural network that takes the state as input and outputs Q-values for all actions.

DL conceptWhere it reappears in RL
Artificial neuronBuilding block of all value and policy networks
Forward propagationComputing Q(s,a) or Ο€(a|s) during inference
Loss function (MSE)DQN loss: \((r + \gamma \max_{a’} Q(s’, a’) - Q(s,a))^2\)
Loss function (cross-entropy)Policy gradient loss
BackpropagationHow Q-networks and policy networks are trained
ReLU activationsStandard hidden-layer activation in DQN, A3C, PPO
SoftmaxAction probability distribution in policy networks
Batch normalizationStabilizing training in deep RL
Convolutional layersProcessing raw pixel observations in Atari DQN
Gradient descent / AdamOptimizing all modern RL networks

Policy gradient methods go further: instead of approximating a value function, they parameterize the policy itself as a neural network Ο€(a|s; ΞΈ) and optimize the expected return directly using gradient ascent. Actor–critic methods combine both: a policy network (actor) and a value network (critic), both trained with backpropagation.

Pedagogical approach: NumPy first

We implement everything in NumPy first. PyTorch is introduced via linked notebooks.

This is intentional. Implementing a neural network forward pass in NumPy β€” manually computing matrix multiplications, writing the ReLU function, computing the softmax β€” gives you a deep understanding of what the framework does for you. When you later call torch.nn.Linear or loss.backward(), you will know exactly what is happening inside.

The in-browser pyrepl exercises use NumPy exclusively because the browser environment (Pyodide) does not support PyTorch. Every concept is fully implementable in NumPy, and the implementations here are pedagogically superior to framework code for learning purposes.

The linked JupyterLite notebooks (see each page) extend the exercises and transition to PyTorch once the concepts are solid.

Table of contents

#PageTopic
1Biological InspirationBrain neurons β†’ artificial neurons
2The PerceptronPerceptron learning rule, AND, XOR limits
3Activation FunctionsReLU, sigmoid, tanh, softmax
4Multi-Layer PerceptronsArchitecture, parameter counting, XOR solved
5Forward PropagationLayer-by-layer computation, batch forward pass
6Loss FunctionsMSE, cross-entropy, loss landscape
7BackpropagationChain rule, gradients, numerical verification
8Gradient Descent for NNsLearning rate, momentum, Adam
9Training LoopMini-batches, epochs, monitoring
10RegularizationDropout, weight decay, early stopping
11Convolutional Neural NetworksFilters, pooling, feature maps
12Batch Norm and ResidualsNormalization, skip connections
13The DQN NetworkPutting it all together for Atari

Quick-start guide

  1. Complete pages in order. Each page builds on the previous one. The concepts are cumulative.
  2. Do every pyrepl exercise. They run in your browser β€” no setup needed. The struggle of implementing in NumPy is where the understanding happens.
  3. Check worked solutions only after a genuine attempt.
  4. Use the extra practice items. Debug exercises (item 5) are especially valuable β€” recognizing broken code trains the same skill as writing correct code.
  5. Open the JupyterLite notebooks for extended practice and PyTorch equivalents.

Estimated time: 2–4 hours per page. The full section takes approximately 30–50 hours.

Assessment checkpoints