Policy Gradient
Overall Progress
0%
Why FA, policy gradient update, DQN exploration, experience replay, and actor-critic — with explanations.
10–12 questions on DQN, policy gradient, PPO, replay, target network. Solutions included.
When a stochastic policy is essential; why deterministic fails.
Derive policy gradient theorem for one-step MDP.
Large step size and policy collapse in bandit; visualize probabilities.
Discriminator expert vs agent; use as reward for policy gradient.
Review Volume 3 (DQN and variants) and preview Volume 4 (Policy Gradients). From value-based to policy-based methods.