Optimizers: SGD, Momentum, and Adam

Learning objectives

Implement SGD, Momentum, and Adam from scratch in NumPy
Understand how learning rate affects convergence speed and stability
Explain why adaptive optimizers like Adam often outperform plain SGD in practice
Recognize that all RL gradient methods are variants of these optimizers

Concept and real-world motivation

An optimizer controls how weights are updated after each gradient computation. The simplest optimizer, Stochastic Gradient Descent (SGD), moves weights in the direction opposite to the gradient by a fixed step size (learning rate \(\alpha\)). However, SGD can oscillate or converge slowly in valleys with steep walls and shallow floors.

Momentum adds a “velocity” term that accumulates past gradients, allowing the optimizer to build up speed in consistent directions and damp oscillations. Think of a ball rolling downhill — it picks up momentum instead of stopping at every bump. Adam goes further by keeping per-parameter adaptive learning rates: parameters that receive large gradients get smaller updates, and rarely updated parameters get larger updates. In practice, Adam is the default for most deep learning and RL work.

In RL: DQN uses Adam. Policy gradient methods use Adam or RMSprop. The TD loss or policy gradient is the “gradient” these optimizers receive. Choosing a learning rate too large causes instability; too small causes painfully slow convergence.

Math:

SGD: \(w \leftarrow w - \alpha \nabla L\)

Momentum: \(v \leftarrow \beta v - \alpha \nabla L\), \(w \leftarrow w + v\)

Adam:

\(m \leftarrow \beta_1 m + (1-\beta_1)\nabla L\) (first moment — mean)
\(v \leftarrow \beta_2 v + (1-\beta_2)(\nabla L)^2\) (second moment — uncentered variance)
Bias-corrected: \(\hat{m} = m/(1-\beta_1^t)\), \(\hat{v} = v/(1-\beta_2^t)\)
\(w \leftarrow w - \alpha \frac{\hat{m}}{\sqrt{\hat{v}}+\epsilon}\)

Illustration — Adam loss curve (50 steps):

Exercise: Implement SGD and Momentum from scratch and compare them minimizing \(f(w) = (w-3)^2\).

Try it — edit and run (Shift+Enter)

Professor’s hints

The learning rate is the single most important hyperparameter. Try 10x larger and 10x smaller to see the effect.
Momentum \(\beta=0.9\) means “remember 90% of previous velocity.” Higher \(\beta\) (e.g. 0.99) gives more momentum but can overshoot.
Adam’s bias correction is important in early steps — without it, the first update would be much too small.
In RL, Adam with lr=3e-4 is a common default that works across many problems.

Common pitfalls

Applying \(\beta\) to the gradient instead of the velocity vector (see debug exercise below).
Forgetting to reset momentum state when you restart training or change the model architecture.
Using the same learning rate for all problems — always tune it.

Worked solution

SGD converges but more slowly. Momentum overshoots slightly then converges faster because it accelerates in the consistent downhill direction.

For Adam from scratch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import numpy as np

def adam(grad_fn, w_init=0.0, lr=0.1, beta1=0.9, beta2=0.999, eps=1e-8, steps=30):
    w = w_init
    m, v = 0.0, 0.0
    history = [w]
    for t in range(1, steps + 1):
        g = grad_fn(w)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g ** 2
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
        history.append(w)
    return w, history

Extra practice

Warm-up: One SGD step by hand: \(f(w) = w^2\), \(w=2\), \(lr=0.1\). Compute \(\nabla f\), then \(w_{new}\).
Try it — edit and run (Shift+Enter)
Coding: Implement Adam from scratch (all 5 update equations). Test on \(f(w)=(w-3)^2\) and verify it converges faster than SGD in fewer steps.
Challenge: Extend to a 2D function \(f(w_1, w_2) = w_1^2 + 10 w_2^2\) (an elongated bowl). Compare SGD and Adam — why does Adam handle the different scales better?
Variant: Implement RMSprop: \(v \leftarrow \rho v + (1-\rho)(\nabla L)^2\), \(w \leftarrow w - \frac{\alpha}{\sqrt{v}+\epsilon}\nabla L\). RMSprop is Adam without momentum — compare them.
Debug: Fix the momentum bug below where \(\beta\) is applied to the gradient instead of the velocity:
Try it — edit and run (Shift+Enter)
Conceptual: Why does Adam use bias correction (\(\hat{m}\) and \(\hat{v}\))? What value would \(m\) take at step \(t=1\) if bias correction were omitted and \(\beta_1=0.9\), gradient=1?
Recall: In one sentence each: (a) What is the role of \(\epsilon\) in Adam? (b) Why do we need momentum? (c) What does a learning rate scheduler do?