Chapter 35: Actor-Critic Architectures

Learning objectives Sketch the architecture of a two-network actor-critic: actor (policy \(\pi(a|s)\)) and critic (value \(V(s)\) or \(Q(s,a)\)). Write pseudocode for the update steps using the TD error \(\delta = r + \gamma V(s’) - V(s)\) as the advantage for the policy. Explain why the critic reduces variance compared to using Monte Carlo returns \(G_t\). Concept and real-world RL Actor-critic methods maintain two networks: the actor selects actions from \(\pi(a|s;\theta)\), and the critic estimates the value function \(V(s;w)\) (or Q). The TD error \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is a one-step estimate of the advantage; it is biased (because V is approximate) but much lower variance than \(G_t\). The actor is updated with \(\nabla \log \pi(a_t|s_t) , \delta_t\); the critic is updated to minimize \((r_t + \gamma V(s_{t+1}) - V(s_t))^2\). In robot control and game AI, actor-critic allows online, step-by-step updates instead of waiting for episode end, which speeds up learning. ...

March 10, 2026 · 3 min · 577 words · codefrydev

Chapter 37: Asynchronous Advantage Actor-Critic (A3C)

Learning objectives Implement A3C: multiple worker processes each running an environment and asynchronously updating a global shared network. Understand the trade-off: A3C can be faster on multi-core CPUs (no synchronization wait) but is often less stable than A2C due to asynchronous gradient updates. Compare training speed (wall clock and/or sample efficiency) of A3C vs A2C on CartPole. Concept and real-world RL A3C (Asynchronous Advantage Actor-Critic) runs multiple workers in parallel, each collecting experience and pushing gradient updates to a global network. Workers do not wait for each other, so gradients are asynchronous and potentially stale. In game AI and early deep RL, A3C was popular for leveraging many CPU cores; in practice, A2C (synchronous) or PPO often give more stable and reproducible results. The idea of parallel envs and shared parameters remains central; the main difference is sync (A2C) vs async (A3C) updates. ...

March 10, 2026 · 3 min · 556 words · codefrydev

Chapter 57: Dreamer and Latent Imagination

Learning objectives Implement a simplified Dreamer-style algorithm: train an RSSM-like model on collected trajectories, then roll out in latent space to train an actor-critic. Understand the imagination phase: no real env steps; only latent rollouts for policy updates. Relate to robot control and sample-efficient RL. Concept and real-world RL Dreamer learns a recurrent state-space model (RSSM) in latent space: encode observation to latent, predict next latent given action, predict reward and continue. The actor-critic is trained on imagined rollouts (latent only), so many gradient steps use no real env interaction. In robot navigation and game AI, this yields high sample efficiency. The key is training the model and the policy on the same data so the latent space is useful for control. ...

March 10, 2026 · 3 min · 464 words · codefrydev

Function Approximation and Deep RL

This page covers function approximation and deep RL concepts you need for the preliminary assessment: why we need FA, the policy gradient update, exploration in DQN, experience replay, and the advantage of actor-critic. Back to Preliminary. Why this matters for RL In large or continuous state spaces we cannot store a value per state; we use a parameterized function (e.g. neural network) to approximate values or policies. That leads to policy gradient methods (maximize return) and value-based methods with FA (e.g. DQN). DQN uses experience replay and exploration (e.g. ε-greedy); actor-critic combines a policy (actor) and a value function (critic) for lower-variance policy gradients. You need to understand why FA is necessary and how these pieces fit together. ...

March 10, 2026 · 7 min · 1400 words · codefrydev