Learning objectives
- Sketch the architecture of a two-network actor-critic: actor (policy \(\pi(a|s)\)) and critic (value \(V(s)\) or \(Q(s,a)\)).
- Write pseudocode for the update steps using the TD error \(\delta = r + \gamma V(s’) - V(s)\) as the advantage for the policy.
- Explain why the critic reduces variance compared to using Monte Carlo returns \(G_t\).
Concept and real-world RL
Actor-critic methods maintain two networks: the actor selects actions from \(\pi(a|s;\theta)\), and the critic estimates the value function \(V(s;w)\) (or Q). The TD error \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is a one-step estimate of the advantage; it is biased (because V is approximate) but much lower variance than \(G_t\). The actor is updated with \(\nabla \log \pi(a_t|s_t) , \delta_t\); the critic is updated to minimize \((r_t + \gamma V(s_{t+1}) - V(s_t))^2\). In robot control and game AI, actor-critic allows online, step-by-step updates instead of waiting for episode end, which speeds up learning.
Where you see this in practice: A2C, A3C, and many continuous control algorithms (DDPG, TD3, SAC) are actor-critic style. The pattern (policy + value/advantage) is central to PPO as well.
Illustration (actor and critic outputs): The critic estimates \(V(s)\); the actor outputs action probabilities. The chart below shows example \(V(s)\) for a few states and the policy entropy over training.
Exercise: Sketch the architecture of a two-network actor-critic: the actor outputs a distribution over actions, the critic outputs a value. Write pseudocode for the update steps using TD error.
Professor’s hints
- Sketch: state \(s\) → shared or separate layers → actor head (logits → softmax → \(\pi(a|s)\)) and critic head (scalar \(V(s)\)). Or two separate networks: one for \(\pi\), one for V.
- Pseudocode: (1) sample \(a \sim \pi(\cdot|s)\), step env → \(s’, r, \mathrm{done}\); (2) \(\delta = r + \gamma (1-\mathrm{done}) V(s’) - V(s)\); (3) actor loss = \(-\log \pi(a|s) , \delta\); critic loss = \(\delta^2\); (4) backward on both, update. Use \(V(s’).detach()\) in \(\delta\) for the actor so gradients do not go through the target.
- TD error replaces \(G_t\): one number per step instead of a sum over the whole episode, so variance is typically lower.
Common pitfalls
- Gradient through target: When computing \(\delta = r + \gamma V(s’)\), do not backprop through \(V(s’)\) when updating the actor (use
.detach()), otherwise the actor would try to change the critic’s target. - Critic and actor learning rates: If the critic learns too fast, V can be noisy and \(\delta\) becomes unreliable. Often the critic uses a separate optimizer or a smaller learning rate.
Worked solution (warm-up: TD error and bias)
Warm-up: \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\). It is a biased estimate of \(A(s_t,a_t)\) because \(V(s_{t+1})\) is an approximation; the true advantage uses the true value function. So \(\delta_t\) has lower variance than \(G_t\) (one step of randomness) but is biased. Actor-critic trades off: we use \(\delta_t\) (or n-step) as the advantage estimate to reduce variance while accepting some bias.
Extra practice
- Warm-up: Write the TD error \(\delta_t\) in terms of \(r_t, \gamma, V(s_t), V(s_{t+1})\). Why is \(\delta_t\) a biased estimate of the advantage \(A(s_t,a_t)\)?
- Coding: Implement a minimal actor-critic: one-step update (single transition). Given \(s, a, r, s’\) and networks \(\pi, V\), compute \(\delta\), actor loss, critic loss, and one gradient step. Test with dummy tensors.
- Challenge: Extend your pseudocode to n-step TD: use \(\delta_t = G_{t:t+n} - V(s_t)\) where \(G_{t:t+n} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})\). How does n affect bias and variance?