Regularization and Overfitting

Learning objectives

Recognize overfitting from learning curves and understand why it happens
Implement L2 regularization: add the penalty to the loss and adjust the gradient
Implement dropout: randomly zero out neurons during training
Understand when each technique is appropriate and how they connect to RL

Concept and real-world motivation

Overfitting happens when a model memorizes the training data instead of learning generalizable patterns. The training loss keeps decreasing, but the validation loss starts increasing — the model has “overfit.” This is especially easy to trigger with large networks on small datasets.

The main fixes: L2 regularization penalizes large weights, encouraging the model to use small distributed representations. Dropout randomly disables neurons during training, preventing co-adaptation — neurons can’t rely on each other and must learn independently useful features. Early stopping halts training when validation loss stops improving.

In RL: Overfitting in RL is called overspecialization — the agent memorizes specific environment states or transitions instead of generalizing. DQN uses target networks and replay buffers partly to reduce this. Policy networks in PPO often use entropy bonuses to avoid overconfident (overfit) policies.

Math:

L2 regularization: \(L_{reg} = L + \frac{\lambda}{2}|w|^2\)

Gradient with L2: \(\nabla L_{reg} = \nabla L + \lambda w\)

Dropout during training: randomly zero out each neuron with probability \(p\), then scale the surviving activations by \(\frac{1}{1-p}\) (inverted dropout) so expected values are unchanged at test time.

Illustration — Overfitting: train loss vs validation loss:

The validation loss follows the train loss early, then diverges — this is the overfitting signal. Stop training at the validation loss minimum.

Exercise: Add L2 regularization to a 1-layer training loop and compare final weights.

Try it — edit and run (Shift+Enter)

Professor’s hints

The regularization strength \(\lambda\) is a hyperparameter: too large shrinks everything to zero; too small has no effect. Typical values: 1e-4 to 1e-2.
Dropout rate \(p=0.5\) is common for fully-connected layers; \(p=0.1\) to \(0.3\) for convolutional layers.
Always turn off dropout at test/evaluation time — only apply it during training.
L2 regularization is equivalent to placing a Gaussian prior over weights (MAP estimation).

Common pitfalls

Applying dropout during evaluation — this is a common bug that degrades test performance unpredictably.
Forgetting to scale activations after dropout (inverted dropout ensures the same expected activation magnitude).
Using L2 on bias terms — convention is to regularize weights but not biases.

Worked solution

L2 adds lambda * w to the gradient. This “weight decay” pulls weights toward zero during every update, preventing any weight from growing very large.

For dropout with inverted scaling:

1
2
3
4
5
def dropout(a, p=0.5, training=True):
    if not training:
        return a  # no dropout at test time
    mask = (np.random.rand(*a.shape) > p).astype(float)
    return (a * mask) / (1 - p)  # scale to preserve expected value

Extra practice

Warm-up: Implement dropout mask in NumPy — randomly set neurons to 0 with p=0.5, then apply inverted scaling.
Try it — edit and run (Shift+Enter)
Coding: Add validation set tracking to the regularization training loop. Plot (conceptually) train loss and val loss. At which epoch does the validation loss stop decreasing?
Challenge: Implement early stopping: monitor validation loss and stop training if it hasn’t improved for 10 consecutive epochs. Return the weights from the best epoch.
Variant: Compare L1 regularization (\(\lambda |w|\)) vs L2 (\(\lambda w^2/2\)) on sparse data. L1 encourages exact zeros; L2 shrinks weights smoothly. Implement both and compare weight histograms.
Debug: Fix the dropout below that is applied during evaluation:
Try it — edit and run (Shift+Enter)
Conceptual: In RL, the policy network is evaluated online during environment interaction. Why is it critical that dropout is disabled (inference mode) during rollout? What would happen if it weren’t?
Recall: Name three regularization techniques and describe in one sentence how each prevents overfitting.