You have completed DL Foundations. This page reviews the key ideas and shows exactly why RL needs neural networks — bridging to DQN and policy gradients.
DL Foundations Recap Quiz
Five questions to confirm your understanding. Answer before revealing each collapse.
Q1. What is the role of activation functions in neural networks?
Answer
max(0,z)) is the default for hidden layers because it doesn’t cause vanishing gradients and is computationally cheap.Q2. What does backpropagation compute?
Answer
Q3. What is the difference between MSE loss and cross-entropy loss?
Answer
- MSE \(= \frac{1}{n}\sum(y_i - \hat{y}_i)^2\) measures squared distance. Use for regression (predicting continuous values, like Q-values in DQN).
- Cross-entropy \(= -\sum y_i \log(\hat{y}_i)\) measures prediction quality for probability distributions. Use for classification and for policy outputs (probability over actions).
In RL: DQN’s TD loss uses MSE; REINFORCE uses log-probability (related to cross-entropy) for the policy gradient.
Q4. Why does adding more layers help with non-linear patterns?
Answer
Q5. In 3 sentences, explain forward propagation.
Answer
What RL Adds to Deep Learning
| Supervised Deep Learning | Deep RL | |
|---|---|---|
| Data source | Fixed labeled dataset | Agent’s own experience (collected during training) |
| Labels | Human-provided | Rewards from the environment (often sparse) |
| Loss function | MSE or cross-entropy | TD error (DQN), policy gradient (REINFORCE/PPO) |
| Training stability | Generally stable | Often unstable (correlated data, moving targets) |
| Exploration | Not needed | Critical — must balance exploration and exploitation |
| Dataset size | Fixed upfront | Grows as agent collects more experience |
Why instability? In supervised learning, targets are fixed. In RL, the target \(r + \gamma \max_a Q(s’, a)\) changes as the Q-network improves. This is like chasing a moving goalposts. DQN addresses this with:
- Target network: A frozen copy of the Q-network used to compute targets
- Replay buffer: Stores past transitions and samples random mini-batches (breaks correlations)
Both are direct consequences of the instability that arises when the “dataset” and the “labels” both depend on the current network.
Bridge Exercise
You know how to train a neural network on fixed data. Now imagine the data changes as you train — that is exactly what happens in RL.
Worked solution and key insight
The bridge exercise shows the fundamental challenge of deep RL: the targets r + gamma * max Q(s', a') depend on the same network you are training. As the network updates, the targets shift — making training unstable.
DQN’s fix: Maintain a second “target network” with parameters \(\theta^-\) (updated only every \(C\) steps). Use it for targets: \(y_i = r + \gamma \max_a Q(s’, a; \theta^-)\). Now targets are stable for \(C\) steps at a time.
| |
Ready for RL?
Check each box before continuing:
- I can implement forward propagation for a 2-layer network in NumPy
- I understand what backpropagation computes (gradient of loss w.r.t. all weights)
- I implemented a training loop with loss tracking (forward → loss → backprop → update)
- I understand why non-linear activations are necessary
- I know when to use MSE vs cross-entropy loss
- I understand the difference between supervised learning and deep RL (moving targets, exploration)
If all boxes are checked: continue to RL.
Next steps:
- Prerequisites: PyTorch for RL — practical PyTorch for RL implementations
- Curriculum Volume 1: Mathematical Foundations — MDPs, Bellman equations, value functions
If any box is unchecked, return to the specific DL Foundations page covering that topic.