Chapter 36: Advantage Actor-Critic (A2C)

Learning objectives

Implement A2C (Advantage Actor-Critic): actor updated with TD error as advantage, critic updated to minimize TD error.
Use the TD error \(r + \gamma V(s’) - V(s)\) as the advantage (optionally with \(V(s’).detach()\)).
Run multiple environments synchronously to collect a batch of transitions and update on the batch (reduces variance further).

Concept and real-world RL

A2C is the synchronous version of A3C: the agent runs \(N\) environments in parallel, collects a batch of transitions, and performs one update from the batch. The advantage is the TD error (or n-step return minus V(s)). Synchronous batching makes the updates more stable than fully asynchronous A3C. In game AI and robot control, A2C is a simple and effective baseline; it is often used with a shared feature extractor (one backbone, actor and critic heads) to save parameters and improve learning.

Where you see this in practice: A2C is used in OpenAI Baselines and many tutorials. The same pattern (multi-env, shared backbone, TD advantage) appears in PPO implementations.

Illustration (A2C learning curve): With 4 parallel envs, mean return over the batch typically improves over updates. The chart below shows mean episode return every 10 updates.

Exercise: Implement A2C for CartPole. Use the TD error \(r + \gamma V(s’) - V(s)\) as the advantage. Use a shared feature extractor or separate networks. Synchronously run multiple environments to collect data.

Professor’s hints

Multi-env: use gym.vector.VectorEnv or a list of envs; step all at once and stack states (batch dimension). Policy and V take batched input; sample one action per env.
Shared backbone: one MLP from state to hidden features; actor head (hidden → logits) and critic head (hidden → scalar). Forward once per state batch.
Update: collect a batch of (s, a, r, s’, done). Compute \(V(s), V(s’)\) for all; \(\delta = r + \gamma (1-\mathrm{done}) V(s’) - V(s)\) (use V(s').detach()). Actor loss = mean over batch of \(-\log \pi(a|s) , \delta\); critic loss = mean of \(\delta^2\). Backward and step.
Hyperparameters: try 4–8 envs, batch of 32–128 steps, learning rate 1e-3 to 3e-4.

Common pitfalls

Not detaching V(s’) in δ for actor: If you backprop \(\delta\) through \(V(s’)\), the actor gradient will try to change the critic’s prediction for \(s’\), which is not the intended objective.
Batch dimension mismatch: Ensure state batch has shape (batch_size, state_dim); action and \(\delta\) have shape (batch_size,) or (batch_size, 1) for gathering log-probs.

Worked solution (warm-up: A2C vs REINFORCE)

Warm-up: REINFORCE uses the full return \(G_t\) to weight the policy gradient; A2C uses the TD error \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) (or n-step) as the advantage. \(G_t\) has higher variance (sum of many random rewards) but is unbiased; \(\delta_t\) has lower variance but is biased because \(V\) is approximate. A2C typically learns faster and more stably because of the lower variance.

Extra practice

Warm-up: What is the difference between A2C and REINFORCE in terms of what signal is used to update the policy (return \(G_t\) vs TD error \(\delta\))? Which has higher variance?
Coding: Implement A2C with 4 parallel CartPole envs. Plot mean episode return (over the 4 envs) every 10 updates. Compare training time (wall clock) to single-env REINFORCE for the same number of steps.
Challenge: Use n-step returns (e.g. n=5): collect 5 steps per env, compute \(G_{t:t+5} = \sum_{i=0}^{4} \gamma^i r_{t+i} + \gamma^5 V(s_{t+5})\), and use \(G_{t:t+5} - V(s_t)\) as advantage. Compare stability and sample efficiency to 1-step TD.
Variant: Try 1, 4, and 8 parallel environments in A2C. Does the number of envs affect the final performance or just wall-clock speed? Plot learning curves for all three settings.
Debug: The code below forgets to detach \(V(s’)\) when computing the TD advantage for the actor loss, causing gradients to flow into the critic through the actor update. Fix it.

Try it — edit and run (Shift+Enter)

Conceptual: Why does synchronous batching (A2C) typically lead to more stable training than asynchronous updates (A3C)? What is the main downside of synchronous updates?
Recall: Describe the A2C update loop in pseudocode: collect batch, compute TD advantage, update actor and critic in one sentence each.