Chapter 36: Advantage Actor-Critic (A2C)
Learning objectives Implement A2C (Advantage Actor-Critic): actor updated with TD error as advantage, critic updated to minimize TD error. Use the TD error \(r + \gamma V(s’) - V(s)\) as the advantage (optionally with \(V(s’).detach()\)). Run multiple environments synchronously to collect a batch of transitions and update on the batch (reduces variance further). Concept and real-world RL A2C is the synchronous version of A3C: the agent runs \(N\) environments in parallel, collects a batch of transitions, and performs one update from the batch. The advantage is the TD error (or n-step return minus V(s)). Synchronous batching makes the updates more stable than fully asynchronous A3C. In game AI and robot control, A2C is a simple and effective baseline; it is often used with a shared feature extractor (one backbone, actor and critic heads) to save parameters and improve learning. ...