Chapter 35: Actor-Critic Architectures

Learning objectives Sketch the architecture of a two-network actor-critic: actor (policy \(\pi(a|s)\)) and critic (value \(V(s)\) or \(Q(s,a)\)). Write pseudocode for the update steps using the TD error \(\delta = r + \gamma V(s’) - V(s)\) as the advantage for the policy. Explain why the critic reduces variance compared to using Monte Carlo returns \(G_t\). Concept and real-world RL Actor-critic methods maintain two networks: the actor selects actions from \(\pi(a|s;\theta)\), and the critic estimates the value function \(V(s;w)\) (or Q). The TD error \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is a one-step estimate of the advantage; it is biased (because V is approximate) but much lower variance than \(G_t\). The actor is updated with \(\nabla \log \pi(a_t|s_t) , \delta_t\); the critic is updated to minimize \((r_t + \gamma V(s_{t+1}) - V(s_t))^2\). In robot control and game AI, actor-critic allows online, step-by-step updates instead of waiting for episode end, which speeds up learning. ...

March 10, 2026 · 3 min · 577 words · codefrydev

Chapter 36: Advantage Actor-Critic (A2C)

Learning objectives Implement A2C (Advantage Actor-Critic): actor updated with TD error as advantage, critic updated to minimize TD error. Use the TD error \(r + \gamma V(s’) - V(s)\) as the advantage (optionally with \(V(s’).detach()\)). Run multiple environments synchronously to collect a batch of transitions and update on the batch (reduces variance further). Concept and real-world RL A2C is the synchronous version of A3C: the agent runs \(N\) environments in parallel, collects a batch of transitions, and performs one update from the batch. The advantage is the TD error (or n-step return minus V(s)). Synchronous batching makes the updates more stable than fully asynchronous A3C. In game AI and robot control, A2C is a simple and effective baseline; it is often used with a shared feature extractor (one backbone, actor and critic heads) to save parameters and improve learning. ...

March 10, 2026 · 3 min · 566 words · codefrydev