You have completed DL Foundations. This page reviews the key ideas and shows exactly why RL needs neural networks — bridging to DQN and policy gradients.


DL Foundations Recap Quiz

Five questions to confirm your understanding. Answer before revealing each collapse.


Q1. What is the role of activation functions in neural networks?

Answer
Activation functions introduce non-linearity. Without them, any number of stacked linear layers collapses to a single linear transformation \(W’x + b’\). Activation functions like ReLU, sigmoid, and tanh allow the network to represent complex, non-linear mappings. In practice, ReLU (max(0,z)) is the default for hidden layers because it doesn’t cause vanishing gradients and is computationally cheap.

Q2. What does backpropagation compute?

Answer
Backpropagation computes the gradient of the loss with respect to every weight and bias in the network. It applies the chain rule from the output layer back to the first layer: \(\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}\). The result is a set of gradient tensors (one per parameter) used by the optimizer to update the weights.

Q3. What is the difference between MSE loss and cross-entropy loss?

Answer
  • MSE \(= \frac{1}{n}\sum(y_i - \hat{y}_i)^2\) measures squared distance. Use for regression (predicting continuous values, like Q-values in DQN).
  • Cross-entropy \(= -\sum y_i \log(\hat{y}_i)\) measures prediction quality for probability distributions. Use for classification and for policy outputs (probability over actions).

In RL: DQN’s TD loss uses MSE; REINFORCE uses log-probability (related to cross-entropy) for the policy gradient.


Q4. Why does adding more layers help with non-linear patterns?

Answer
Each layer learns to combine features from the previous layer into higher-level representations. A 1-layer network can only learn simple non-linearities; a 2-layer network can approximate any continuous function (universal approximation theorem). Deeper networks learn hierarchical features: edges → textures → shapes → objects (for images); raw positions → velocity → dynamics → policy (for RL states).

Q5. In 3 sentences, explain forward propagation.

Answer
Forward propagation computes the network’s output given an input. Starting from the input layer, each layer applies a linear transformation \(z = Wx + b\) followed by a non-linear activation \(a = f(z)\), and passes the result to the next layer. The final layer’s output is the network’s prediction — for classification, probabilities; for regression, a scalar or vector value.

What RL Adds to Deep Learning

Supervised Deep LearningDeep RL
Data sourceFixed labeled datasetAgent’s own experience (collected during training)
LabelsHuman-providedRewards from the environment (often sparse)
Loss functionMSE or cross-entropyTD error (DQN), policy gradient (REINFORCE/PPO)
Training stabilityGenerally stableOften unstable (correlated data, moving targets)
ExplorationNot neededCritical — must balance exploration and exploitation
Dataset sizeFixed upfrontGrows as agent collects more experience

Why instability? In supervised learning, targets are fixed. In RL, the target \(r + \gamma \max_a Q(s’, a)\) changes as the Q-network improves. This is like chasing a moving goalposts. DQN addresses this with:

  • Target network: A frozen copy of the Q-network used to compute targets
  • Replay buffer: Stores past transitions and samples random mini-batches (breaks correlations)

Both are direct consequences of the instability that arises when the “dataset” and the “labels” both depend on the current network.


Bridge Exercise

You know how to train a neural network on fixed data. Now imagine the data changes as you train — that is exactly what happens in RL.

Try it — edit and run (Shift+Enter)
Worked solution and key insight

The bridge exercise shows the fundamental challenge of deep RL: the targets r + gamma * max Q(s', a') depend on the same network you are training. As the network updates, the targets shift — making training unstable.

DQN’s fix: Maintain a second “target network” with parameters \(\theta^-\) (updated only every \(C\) steps). Use it for targets: \(y_i = r + \gamma \max_a Q(s’, a; \theta^-)\). Now targets are stable for \(C\) steps at a time.

1
2
3
4
5
6
7
# Target network (NumPy version):
W1_target, b1_target = W1.copy(), b1.copy()
W2_target, b2_target = W2.copy(), b2.copy()

# Every C steps:
# W1_target, b1_target = W1.copy(), b1.copy()
# W2_target, b2_target = W2.copy(), b2.copy()

Ready for RL?

Check each box before continuing:

  • I can implement forward propagation for a 2-layer network in NumPy
  • I understand what backpropagation computes (gradient of loss w.r.t. all weights)
  • I implemented a training loop with loss tracking (forward → loss → backprop → update)
  • I understand why non-linear activations are necessary
  • I know when to use MSE vs cross-entropy loss
  • I understand the difference between supervised learning and deep RL (moving targets, exploration)

If all boxes are checked: continue to RL.

Next steps:

If any box is unchecked, return to the specific DL Foundations page covering that topic.