This page covers the calculus you need for the preliminary assessment: derivatives of common functions, the chain rule, and how they appear in logistic regression and policy gradients. Back to Preliminary.
Why this matters for RL
Policy gradients and loss-based updates use derivatives and the chain rule. You don’t need to derive everything by hand in practice (autograd does it), but you need to understand what a gradient is and how it’s used. The sigmoid and chain rule appear in logistic policies and in backpropagation.
Learning objectives
Compute derivatives of \(\ln(1+e^x)\) and simple composites; state and apply the chain rule; connect these to sigmoid and to RL updates.
Core concepts
- Derivative rules: \(\frac{d}{dx}e^x = e^x\), \(\frac{d}{dx}\ln x = \frac{1}{x}\). So \(\frac{d}{dx}\ln(1+e^x) = \frac{e^x}{1+e^x} = \sigma(x)\) (sigmoid).
- Chain rule: If \(y = f(u)\) and \(u = g(x)\), then \(\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}\). For compositions, multiply derivatives along the path.
- In RL: Gradient descent minimizes loss: \(\theta \leftarrow \theta - \alpha \nabla_\theta L\). Policy gradient maximizes return: \(\theta \leftarrow \theta + \alpha \nabla_\theta J\). Backprop is the chain rule applied layer by layer.
Worked problems (with explanations)
1. Derivative of \(\ln(1+e^x)\) (Q5)
Q: Compute the derivative of \(f(x) = \ln(1+e^x)\) with respect to \(x\). This function appears in logistic regression and softmax.
\(f’(x) = \frac{e^x}{1+e^x} = \sigma(x)\) (the sigmoid function). Let \(u = 1 + e^x\). Then \(f = \ln u\), so \(\frac{df}{du} = \frac{1}{u}\) and \(\frac{du}{dx} = e^x\). By the chain rule, \(\frac{df}{dx} = \frac{1}{u} \cdot e^x = \frac{e^x}{1+e^x}\). The sigmoid \(\sigma(x) = \frac{e^x}{1+e^x}\) is the derivative of the “softplus” \(\ln(1+e^x)\). In logistic regression, the log-likelihood involves \(\ln(1+e^x)\); its gradient with respect to the linear predictor involves the sigmoid. In RL, sigmoid is used for binary action probabilities and appears in policy gradient formulas.Answer and explanation
Derivation
Explanation
2. Chain rule statement and example (Q6)
Q: What is the chain rule for derivatives? Give an example: if \(y = f(u)\) and \(u = g(x)\), express \(\frac{dy}{dx}\). Apply to \(y = \sin(x^2)\).
\(\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}\). \(y = \sin(x^2)\). Let \(u = x^2\), so \(y = \sin u\). Then \(\frac{dy}{du} = \cos u\) and \(\frac{du}{dx} = 2x\). So \(\frac{dy}{dx} = \cos(x^2) \cdot 2x\). The chain rule says: the rate of change of \(y\) with respect to \(x\) is the rate of change of \(y\) with respect to \(u\) times the rate of change of \(u\) with respect to \(x\). In neural networks, the output is a long composition of functions; backprop multiplies derivatives along the path from output to each parameter, which is exactly the chain rule.Answer and explanation
Example
Explanation
3. Another derivative by chain rule
Q: Compute \(\frac{d}{dx}\bigl(e^{x^2}\bigr)\).
Let \(u = x^2\), so \(y = e^u\). Then \(\frac{dy}{du} = e^u\) and \(\frac{du}{dx} = 2x\). So \(\frac{dy}{dx} = e^{x^2} \cdot 2x\). The inner function is \(x^2\), so we must multiply by \(2x\). Forgetting the inner derivative is a common mistake. In RL, when we differentiate a loss with respect to an early layer’s weights, we get a product of many such factors (one per layer) via the chain rule.Answer and explanation
Explanation
Math example: chain rule tree
For \(y = f(g(h(x)))\):
- \(\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}\).
Each factor is the derivative of one “layer” with respect to its input. In a neural network, \(h\) might be the first layer, \(g\) the second, \(f\) the loss. Backprop computes these derivatives from output to input and multiplies them. So the chain rule is the reason we can train deep networks: we break the gradient into local pieces and multiply.
Code example (with explanation)
| |
Explanation
PyTorch builds a computation graph. When we call y.backward(), it applies the chain rule: the gradient of \(\sqrt{u}\) is \(1/(2\sqrt{u})\), and \(du/dx = 2x\), so \(dy/dx = x/\sqrt{1+x^2}\). At \(x=2\), that is \(2/\sqrt{5}\). Autograd does this automatically for any composition—that’s how we train policies and value networks in RL.
Professor’s hints
- Policy gradient maximizes return, so the update is \(\theta \leftarrow \theta + \alpha \nabla_\theta J\) (plus sign). Loss minimization uses minus.
- Always include the inner derivative in the chain rule: for \(f(g(x))\), multiply by \(g’(x)\).
- In RL, we differentiate the log-probability or value prediction with respect to parameters; the reward is usually not differentiated.
Common pitfalls
- Maximize vs minimize: Policy gradient uses a plus in the update; loss minimization uses minus. Flipping the sign flips the direction of learning.
- Forgetting the inner derivative: \(\frac{d}{dx}e^{x^2} = e^{x^2} \cdot 2x\), not just \(e^{x^2}\).
- Treating constants as variables: When differentiating w.r.t. \(\theta\), quantities that don’t depend on \(\theta\) (e.g. observed rewards) have derivative zero.