Introduction to policy-based methods, the policy objective, REINFORCE, variance reduction, actor-critic, A2C, A3C, continuous action spaces, DDPG, and TD3. Chapters 31–40.
Volume 4: Policy Gradients
Chapters 31–40 — Policy-based methods, REINFORCE, variance reduction, actor-critic, A2C, A3C, continuous actions, DDPG, TD3.
When a stochastic policy is essential; why deterministic fails.
Derive policy gradient theorem for one-step MDP.
REINFORCE for CartPole with softmax policy; note variance.
State-value baseline with REINFORCE; compare gradient variance.
Sketch two-network actor-critic; pseudocode for TD error updates.
A2C for CartPole with TD error as advantage; sync multi-env.
A3C with multiprocessing workers; compare speed with A2C.
Policy network for Pendulum: Gaussian mean and log-std; log-prob.
DDPG for Pendulum with OU noise and target networks.
TD3: clipped double Q, delayed policy, target smoothing.
Review Volume 4 (Policy Gradients, Actor-Critic, DDPG, TD3) and preview Volume 5 (PPO, TRPO, SAC — stable, scalable policy optimization).