Chapter 36: Advantage Actor-Critic (A2C)

Learning objectives Implement A2C (Advantage Actor-Critic): actor updated with TD error as advantage, critic updated to minimize TD error. Use the TD error \(r + \gamma V(s’) - V(s)\) as the advantage (optionally with \(V(s’).detach()\)). Run multiple environments synchronously to collect a batch of transitions and update on the batch (reduces variance further). Concept and real-world RL A2C is the synchronous version of A3C: the agent runs \(N\) environments in parallel, collects a batch of transitions, and performs one update from the batch. The advantage is the TD error (or n-step return minus V(s)). Synchronous batching makes the updates more stable than fully asynchronous A3C. In game AI and robot control, A2C is a simple and effective baseline; it is often used with a shared feature extractor (one backbone, actor and critic heads) to save parameters and improve learning. ...

March 10, 2026 · 3 min · 566 words · codefrydev

Chapter 63: Curiosity-Driven Exploration (ICM)

Learning objectives Implement the Intrinsic Curiosity Module: a forward model that predicts next-state features from current state and action. Use prediction error (between predicted and actual next features) as intrinsic reward and combine it with A2C. Explain why prediction error encourages exploration in novel or stochastic parts of the state space. Compare exploration behavior (e.g. coverage, time to goal) with and without ICM on a sparse-reward maze. Relate curiosity-driven exploration to robot navigation and game AI where rewards are sparse. Concept and real-world RL ...

March 10, 2026 · 3 min · 624 words · codefrydev