Chapter 37: Asynchronous Advantage Actor-Critic (A3C)
Learning objectives Implement A3C: multiple worker processes each running an environment and asynchronously updating a global shared network. Understand the trade-off: A3C can be faster on multi-core CPUs (no synchronization wait) but is often less stable than A2C due to asynchronous gradient updates. Compare training speed (wall clock and/or sample efficiency) of A3C vs A2C on CartPole. Concept and real-world RL A3C (Asynchronous Advantage Actor-Critic) runs multiple workers in parallel, each collecting experience and pushing gradient updates to a global network. Workers do not wait for each other, so gradients are asynchronous and potentially stale. In game AI and early deep RL, A3C was popular for leveraging many CPU cores; in practice, A2C (synchronous) or PPO often give more stable and reproducible results. The idea of parallel envs and shared parameters remains central; the main difference is sync (A2C) vs async (A3C) updates. ...