Volume 4: Policy Gradients

Chapters 31–40 — Policy-based methods, REINFORCE, variance reduction, actor-critic, A2C, A3C, continuous actions, DDPG, TD3.

Overall Progress 0%

When a stochastic policy is essential; why deterministic fails.

Go to Chapter 31: Introduction to Policy-Based Methods →

Derive policy gradient theorem for one-step MDP.

Go to Chapter 32: The Policy Objective Function →

REINFORCE for CartPole with softmax policy; note variance.

Go to Chapter 33: The REINFORCE Algorithm →

State-value baseline with REINFORCE; compare gradient variance.

Go to Chapter 34: Reducing Variance in Policy Gradients →

Sketch two-network actor-critic; pseudocode for TD error updates.

Go to Chapter 35: Actor-Critic Architectures →

A2C for CartPole with TD error as advantage; sync multi-env.

Go to Chapter 36: Advantage Actor-Critic (A2C) →

A3C with multiprocessing workers; compare speed with A2C.

Go to Chapter 37: Asynchronous Advantage Actor-Critic (A3C) →

Policy network for Pendulum: Gaussian mean and log-std; log-prob.

Go to Chapter 38: Continuous Action Spaces →

DDPG for Pendulum with OU noise and target networks.

Go to Chapter 39: Deep Deterministic Policy Gradient (DDPG) →

TD3: clipped double Q, delayed policy, target smoothing.

Go to Chapter 40: Twin Delayed DDPG (TD3) →

Review Volume 4 (Policy Gradients, Actor-Critic, DDPG, TD3) and preview Volume 5 (PPO, TRPO, SAC — stable, scalable policy optimization).

Go to Volume 4 Review & Bridge to Volume 5 →

Introduction to policy-based methods, the policy objective, REINFORCE, variance reduction, actor-critic, A2C, A3C, continuous action spaces, DDPG, and TD3. Chapters 31–40.