Learning objectives

  • Implement SGD, Momentum, and Adam from scratch in NumPy
  • Understand how learning rate affects convergence speed and stability
  • Explain why adaptive optimizers like Adam often outperform plain SGD in practice
  • Recognize that all RL gradient methods are variants of these optimizers

Concept and real-world motivation

An optimizer controls how weights are updated after each gradient computation. The simplest optimizer, Stochastic Gradient Descent (SGD), moves weights in the direction opposite to the gradient by a fixed step size (learning rate \(\alpha\)). However, SGD can oscillate or converge slowly in valleys with steep walls and shallow floors.

Momentum adds a “velocity” term that accumulates past gradients, allowing the optimizer to build up speed in consistent directions and damp oscillations. Think of a ball rolling downhill — it picks up momentum instead of stopping at every bump. Adam goes further by keeping per-parameter adaptive learning rates: parameters that receive large gradients get smaller updates, and rarely updated parameters get larger updates. In practice, Adam is the default for most deep learning and RL work.

In RL: DQN uses Adam. Policy gradient methods use Adam or RMSprop. The TD loss or policy gradient is the “gradient” these optimizers receive. Choosing a learning rate too large causes instability; too small causes painfully slow convergence.

Math:

SGD: \(w \leftarrow w - \alpha \nabla L\)

Momentum: \(v \leftarrow \beta v - \alpha \nabla L\), \(w \leftarrow w + v\)

Adam:

  • \(m \leftarrow \beta_1 m + (1-\beta_1)\nabla L\) (first moment — mean)
  • \(v \leftarrow \beta_2 v + (1-\beta_2)(\nabla L)^2\) (second moment — uncentered variance)
  • Bias-corrected: \(\hat{m} = m/(1-\beta_1^t)\), \(\hat{v} = v/(1-\beta_2^t)\)
  • \(w \leftarrow w - \alpha \frac{\hat{m}}{\sqrt{\hat{v}}+\epsilon}\)

Illustration — Adam loss curve (50 steps):

Exercise: Implement SGD and Momentum from scratch and compare them minimizing \(f(w) = (w-3)^2\).

Try it — edit and run (Shift+Enter)

Professor’s hints

  • The learning rate is the single most important hyperparameter. Try 10x larger and 10x smaller to see the effect.
  • Momentum \(\beta=0.9\) means “remember 90% of previous velocity.” Higher \(\beta\) (e.g. 0.99) gives more momentum but can overshoot.
  • Adam’s bias correction is important in early steps — without it, the first update would be much too small.
  • In RL, Adam with lr=3e-4 is a common default that works across many problems.

Common pitfalls

  • Applying \(\beta\) to the gradient instead of the velocity vector (see debug exercise below).
  • Forgetting to reset momentum state when you restart training or change the model architecture.
  • Using the same learning rate for all problems — always tune it.
Worked solution

SGD converges but more slowly. Momentum overshoots slightly then converges faster because it accelerates in the consistent downhill direction.

For Adam from scratch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import numpy as np

def adam(grad_fn, w_init=0.0, lr=0.1, beta1=0.9, beta2=0.999, eps=1e-8, steps=30):
    w = w_init
    m, v = 0.0, 0.0
    history = [w]
    for t in range(1, steps + 1):
        g = grad_fn(w)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g ** 2
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
        history.append(w)
    return w, history

Extra practice

  1. Warm-up: One SGD step by hand: \(f(w) = w^2\), \(w=2\), \(lr=0.1\). Compute \(\nabla f\), then \(w_{new}\).

    Try it — edit and run (Shift+Enter)

  2. Coding: Implement Adam from scratch (all 5 update equations). Test on \(f(w)=(w-3)^2\) and verify it converges faster than SGD in fewer steps.

  3. Challenge: Extend to a 2D function \(f(w_1, w_2) = w_1^2 + 10 w_2^2\) (an elongated bowl). Compare SGD and Adam — why does Adam handle the different scales better?

  4. Variant: Implement RMSprop: \(v \leftarrow \rho v + (1-\rho)(\nabla L)^2\), \(w \leftarrow w - \frac{\alpha}{\sqrt{v}+\epsilon}\nabla L\). RMSprop is Adam without momentum — compare them.

  5. Debug: Fix the momentum bug below where \(\beta\) is applied to the gradient instead of the velocity:

    Try it — edit and run (Shift+Enter)

  6. Conceptual: Why does Adam use bias correction (\(\hat{m}\) and \(\hat{v}\))? What value would \(m\) take at step \(t=1\) if bias correction were omitted and \(\beta_1=0.9\), gradient=1?

  7. Recall: In one sentence each: (a) What is the role of \(\epsilon\) in Adam? (b) Why do we need momentum? (c) What does a learning rate scheduler do?