Policy Gradient

Why FA, policy gradient update, DQN exploration, experience replay, and actor-critic — with explanations.

10–12 questions on DQN, policy gradient, PPO, replay, target network. Solutions included.

When a stochastic policy is essential; why deterministic fails.

Derive policy gradient theorem for one-step MDP.

Large step size and policy collapse in bandit; visualize probabilities.

Discriminator expert vs agent; use as reward for policy gradient.

Review Volume 3 (DQN and variants) and preview Volume 4 (Policy Gradients). From value-based to policy-based methods.