Chapter 32: The Policy Objective Function

Learning objectives Write the policy gradient theorem for a simple one-step MDP: the gradient of expected reward with respect to policy parameters. Show that \(\nabla_\theta \mathbb{E}[R] = \mathbb{E}[ \nabla_\theta \log \pi(a|s;\theta) , Q^\pi(s,a) ]\) (or equivalent for one step). Recognize why this form is useful: we can estimate the expectation from samples (trajectories) without knowing the transition model. Concept and real-world RL In policy gradient methods we maximize the expected return \(J(\theta) = \mathbb{E}\pi[G]\) by gradient ascent on \(\theta\). The policy gradient theorem says that \(\nabla\theta J\) can be written as an expectation over states and actions under \(\pi\), involving \(\nabla_\theta \log \pi(a|s;\theta)\) and the return (or Q). For a one-step MDP (one state, one action, one reward), the derivation is simple: \(J = \sum_a \pi(a|s) r(s,a)\), so \(\nabla_\theta J = \sum_a \nabla_\theta \pi(a|s) , r(s,a)\). Using the log-derivative trick \(\nabla \pi = \pi \nabla \log \pi\), we get \(\mathbb{E}[ \nabla \log \pi(a|s) , Q(s,a) ]\). In robot control or game AI, we rarely have the full model; this identity lets us estimate the gradient from sampled actions and rewards only. ...

March 10, 2026 · 3 min · 585 words · codefrydev