Learning objectives
- Implement SGD, Momentum, and Adam from scratch in NumPy
- Understand how learning rate affects convergence speed and stability
- Explain why adaptive optimizers like Adam often outperform plain SGD in practice
- Recognize that all RL gradient methods are variants of these optimizers
Concept and real-world motivation
An optimizer controls how weights are updated after each gradient computation. The simplest optimizer, Stochastic Gradient Descent (SGD), moves weights in the direction opposite to the gradient by a fixed step size (learning rate \(\alpha\)). However, SGD can oscillate or converge slowly in valleys with steep walls and shallow floors.
Momentum adds a “velocity” term that accumulates past gradients, allowing the optimizer to build up speed in consistent directions and damp oscillations. Think of a ball rolling downhill — it picks up momentum instead of stopping at every bump. Adam goes further by keeping per-parameter adaptive learning rates: parameters that receive large gradients get smaller updates, and rarely updated parameters get larger updates. In practice, Adam is the default for most deep learning and RL work.
In RL: DQN uses Adam. Policy gradient methods use Adam or RMSprop. The TD loss or policy gradient is the “gradient” these optimizers receive. Choosing a learning rate too large causes instability; too small causes painfully slow convergence.
Math:
SGD: \(w \leftarrow w - \alpha \nabla L\)
Momentum: \(v \leftarrow \beta v - \alpha \nabla L\), \(w \leftarrow w + v\)
Adam:
- \(m \leftarrow \beta_1 m + (1-\beta_1)\nabla L\) (first moment — mean)
- \(v \leftarrow \beta_2 v + (1-\beta_2)(\nabla L)^2\) (second moment — uncentered variance)
- Bias-corrected: \(\hat{m} = m/(1-\beta_1^t)\), \(\hat{v} = v/(1-\beta_2^t)\)
- \(w \leftarrow w - \alpha \frac{\hat{m}}{\sqrt{\hat{v}}+\epsilon}\)
Illustration — Adam loss curve (50 steps):
Exercise: Implement SGD and Momentum from scratch and compare them minimizing \(f(w) = (w-3)^2\).
Professor’s hints
- The learning rate is the single most important hyperparameter. Try 10x larger and 10x smaller to see the effect.
- Momentum \(\beta=0.9\) means “remember 90% of previous velocity.” Higher \(\beta\) (e.g. 0.99) gives more momentum but can overshoot.
- Adam’s bias correction is important in early steps — without it, the first update would be much too small.
- In RL, Adam with
lr=3e-4is a common default that works across many problems.
Common pitfalls
- Applying \(\beta\) to the gradient instead of the velocity vector (see debug exercise below).
- Forgetting to reset momentum state when you restart training or change the model architecture.
- Using the same learning rate for all problems — always tune it.
Worked solution
SGD converges but more slowly. Momentum overshoots slightly then converges faster because it accelerates in the consistent downhill direction.
For Adam from scratch:
| |
Extra practice
Warm-up: One SGD step by hand: \(f(w) = w^2\), \(w=2\), \(lr=0.1\). Compute \(\nabla f\), then \(w_{new}\).
Try it — edit and run (Shift+Enter)Coding: Implement Adam from scratch (all 5 update equations). Test on \(f(w)=(w-3)^2\) and verify it converges faster than SGD in fewer steps.
Challenge: Extend to a 2D function \(f(w_1, w_2) = w_1^2 + 10 w_2^2\) (an elongated bowl). Compare SGD and Adam — why does Adam handle the different scales better?
Variant: Implement RMSprop: \(v \leftarrow \rho v + (1-\rho)(\nabla L)^2\), \(w \leftarrow w - \frac{\alpha}{\sqrt{v}+\epsilon}\nabla L\). RMSprop is Adam without momentum — compare them.
Debug: Fix the momentum bug below where \(\beta\) is applied to the gradient instead of the velocity:
Try it — edit and run (Shift+Enter)Conceptual: Why does Adam use bias correction (\(\hat{m}\) and \(\hat{v}\))? What value would \(m\) take at step \(t=1\) if bias correction were omitted and \(\beta_1=0.9\), gradient=1?
Recall: In one sentence each: (a) What is the role of \(\epsilon\) in Adam? (b) Why do we need momentum? (c) What does a learning rate scheduler do?