Learning objectives

  • Write the policy gradient theorem for a simple one-step MDP: the gradient of expected reward with respect to policy parameters.
  • Show that \(\nabla_\theta \mathbb{E}[R] = \mathbb{E}[ \nabla_\theta \log \pi(a|s;\theta) , Q^\pi(s,a) ]\) (or equivalent for one step).
  • Recognize why this form is useful: we can estimate the expectation from samples (trajectories) without knowing the transition model.

Concept and real-world RL

In policy gradient methods we maximize the expected return \(J(\theta) = \mathbb{E}\pi[G]\) by gradient ascent on \(\theta\). The policy gradient theorem says that \(\nabla\theta J\) can be written as an expectation over states and actions under \(\pi\), involving \(\nabla_\theta \log \pi(a|s;\theta)\) and the return (or Q). For a one-step MDP (one state, one action, one reward), the derivation is simple: \(J = \sum_a \pi(a|s) r(s,a)\), so \(\nabla_\theta J = \sum_a \nabla_\theta \pi(a|s) , r(s,a)\). Using the log-derivative trick \(\nabla \pi = \pi \nabla \log \pi\), we get \(\mathbb{E}[ \nabla \log \pi(a|s) , Q(s,a) ]\). In robot control or game AI, we rarely have the full model; this identity lets us estimate the gradient from sampled actions and rewards only.

Where you see this in practice: The policy gradient theorem is the foundation for REINFORCE, actor-critic, and PPO. It appears in robotics (policy search), game playing, and dialogue systems.

Illustration (gradient magnitude): Policy gradient updates scale with the return; higher return trajectories get larger updates. The chart below shows the magnitude of \(\nabla \log \pi(a|s)\) weighted by return over a few steps (conceptual).

Exercise: Derive the policy gradient theorem for a simple one-step MDP. Show that the gradient of the expected reward is \(\mathbb{E}[\nabla \log \pi(a|s) Q^\pi(s,a)]\).

Professor’s hints

  • One-step MDP: single state \(s\), agent samples \(a \sim \pi(\cdot|s)\), gets reward \(r(s,a)\). So \(J(\theta) = \mathbb{E}_{a \sim \pi}[r(s,a)] = \sum_a \pi(a|s) r(s,a)\).
  • Log-derivative trick: \(\nabla_\theta \pi(a|s) = \pi(a|s) , \nabla_\theta \log \pi(a|s)\). So \(\nabla_\theta J = \sum_a \pi(a|s) , \nabla_\theta \log \pi(a|s) , r(s,a) = \mathbb{E}\pi[ \nabla\theta \log \pi(a|s) , r(s,a) ]\). For one step, \(r(s,a) = Q^\pi(s,a)\).
  • In the multi-step case, \(Q^\pi(s,a)\) is replaced by the return from that step (e.g. \(G_t\)) in the full theorem.

Common pitfalls

  • Wrong sign: We maximize \(J\), so the update is \(\theta \leftarrow \theta + \alpha \nabla_\theta J\), not minus. Loss-based frameworks often minimize \(-J\), in which case gradient descent on \(-J\) is equivalent.
  • Forgetting the expectation: The theorem gives an expectation; in practice we use a sample (one trajectory or one action) to get an unbiased estimate of the gradient.

Worked solution (warm-up: J(p) and policy gradient form)Warm-up: \(J(p) = p r_1 + (1-p) r_2\). So \(dJ/dp = r_1 - r_2\). In policy gradient form: \(\nabla_\theta J = \mathbb{E}[ \nabla_\theta \log \pi(a|s) \cdot r ]\); for this one-step MDP, \(\nabla \log \pi(a_1|s) = 1/p\) and \(\nabla \log \pi(a_2|s) = -1/(1-p)\), so \(\mathbb{E}[ \nabla \log \pi \cdot r ] = p \cdot (1/p) \cdot r_1 + (1-p) \cdot (-1/(1-p)) \cdot r_2 = r_1 - r_2 = dJ/dp\). This is the policy gradient theorem in the simplest case.

Extra practice

  1. Warm-up: For a one-step MDP with two actions and \(\pi(a_1|s)=p\), \(\pi(a_2|s)=1-p\), and rewards \(r_1, r_2\), write \(J(p)\) and compute \(dJ/dp\) by hand. Then write it in the form \(\mathbb{E}[ \nabla \log \pi , r ]\).
  2. Coding: In Python, for a discrete policy \(\pi = \mathrm{softmax}(\theta)\) with two actions, compute \(\nabla_\theta \log \pi(a|s)\) numerically (finite differences) and symbolically (derivative of log-softmax) and check they match.
  3. Challenge: State the policy gradient theorem for the multi-step case (infinite horizon or episodic). What replaces \(Q^\pi(s,a)\) when we use Monte Carlo returns?