Volume 4 Recap Quiz (5 questions)

Q1. Write the REINFORCE gradient estimator. What is its key weakness?

∇J(θ) ≈ (1/N) Σ_τ Σ_t ∇log π(a_t|s_t; θ) · G_t

where G_t is the Monte Carlo return from step t. Key weakness: high variance. G_t accumulates reward noise over the entire episode, making gradient estimates noisy — learning is slow and unstable without a baseline.

Q2. What is the advantage function A(s,a), and how does it reduce variance?
A(s,a) = Q(s,a) − V(s). It measures how much better action a is compared to the average action in state s. Using A instead of raw returns centers the signal around zero — good actions get positive gradient, bad actions get negative gradient, and the common “tide” (V(s)) is subtracted out. This reduces variance without introducing bias if V is exact.
Q3. In actor-critic, what do the actor and critic each do?
  • Actor: the policy π(a|s; θ) — selects actions and is updated via policy gradient.
  • Critic: estimates V(s; w) or Q(s,a; w) — provides the advantage/baseline signal to reduce variance in the actor’s gradient.

The critic uses TD learning (bootstrapping), so the actor no longer needs to wait for full episode returns. This enables online (step-by-step) updates.

Q4. How does DDPG extend actor-critic to continuous actions?
DDPG (Deep Deterministic Policy Gradient) uses a deterministic policy μ(s; θ) that outputs a single action (not a distribution). The gradient becomes: ∇J ≈ ∇a Q(s,a)|{a=μ(s)} · ∇_θ μ(s; θ). This avoids sampling from a distribution. It also uses experience replay and target networks (from DQN) to stabilize training. TD3 extends this with twin critics and delayed policy updates.
Q5. What is the main practical problem with vanilla REINFORCE/actor-critic?
Large policy updates destabilize training. A bad gradient step can move the policy far into a poor region — and because the new policy collects different data, recovery is slow or impossible. This motivates Volume 5: PPO/TRPO explicitly constrain how much the policy can change per update, giving much more stable training at scale.

What Changes in Volume 5

Volume 4 (Basic Policy Gradient)Volume 5 (Stable Policy Optimization)
Update constraintNone — step size chosen by handPPO: ratio clip; TRPO: KL constraint
Variance reductionBaseline / advantageGAE (λ-weighted advantage)
Off-policy supportLimited (DDPG/TD3)SAC: maximum entropy, off-policy
Sample efficiencyLow (on-policy, discard after update)Moderate (PPO epochs; SAC replay)
EntropyNot explicitSAC maximises entropy for exploration

The big insight: Controlling the size of policy updates via clipping (PPO) or trust-region constraints (TRPO) makes training dramatically more stable. GAE smoothly interpolates between TD (low variance, bias) and MC (high variance, unbiased).


Bridge Exercise: REINFORCE Variance on a Bandit

Try it — edit and run (Shift+Enter)

Next: Volume 5: PPO, TRPO & SAC