Volume 5 Recap Quiz (5 questions)
Q1. What does PPO's clipped surrogate objective do, and why is it useful?
PPO clips the probability ratio r_t(θ) = π(a|s;θ) / π(a|s;θ_old) to stay within [1−ε, 1+ε]. The objective is: L_CLIP = E[min(r_t A_t, clip(r_t, 1−ε, 1+ε) A_t)]. This prevents the policy from moving too far from the old policy in a single update — without needing to solve a constrained optimization problem (unlike TRPO). It’s simpler, faster, and works well in practice.
Q2. What is Generalized Advantage Estimation (GAE) and what does λ control?
GAE(λ) = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}, where δ_t = r_t + γV(s_{t+1}) − V(s_t) is the TD error. λ interpolates between: λ=0 → pure 1-step TD advantage (low variance, high bias); λ=1 → Monte Carlo advantage (high variance, low bias). Values like λ=0.95 give a good bias-variance tradeoff in practice.
Q3. What is the maximum entropy objective in SAC?
SAC maximises: J(π) = E[Σ_t r_t + α · H(π(·|s_t))], where H is the entropy of the policy and α is a temperature parameter. The entropy bonus encourages exploration and prevents premature convergence to deterministic policies. SAC is off-policy (uses a replay buffer) making it more sample-efficient than PPO, while the entropy regularization gives it robustness.
Q4. How does TRPO differ from PPO in enforcing the trust region?
TRPO solves a constrained optimization: maximize E[r_t A_t] subject to KL(π_old || π_new) ≤ δ. This requires computing the Fisher information matrix and solving a conjugate gradient problem — expensive. PPO approximates the same idea with the simple clip trick, avoiding second-order optimization. PPO is roughly as good as TRPO in practice but far simpler to implement.
Q5. When would you choose SAC over PPO?
SAC is preferable when: (1) the environment is expensive to simulate (SAC is more sample-efficient via replay); (2) the action space is continuous; (3) you want robustness to hyperparameters (entropy auto-tuning). PPO is preferable when: (1) the environment is fast to simulate (parallelism compensates for sample inefficiency); (2) discrete actions; (3) you want simplicity and proven stability across diverse tasks.
What Changes in Volume 6
| Volume 5 (Model-Free) | Volume 6 (Model-Based) | |
|---|---|---|
| Environment model | Black-box — just sample (s,a,r,s') | Learn ŝ’ = f(s,a) and r̂ = r(s,a) |
| Data efficiency | Moderate to low | High — plan using the model |
| Planning | None | MCTS, Dyna-Q, shooting methods |
| Risk | Only real experience used | Model error compounds (hallucinations) |
| Best for | Fast simulators, complex rewards | Real-world, expensive interactions |
The big insight: If you know (or can learn) how the world works, you can plan ahead rather than only react. MCTS uses a model to search the game tree (AlphaGo). Dyna-Q uses a model to generate synthetic transitions. But learned models are imperfect — compounding errors over long rollouts is the central challenge.
Bridge Exercise: When Would You Use a Model?
Try it — edit and run (Shift+Enter)
Next: Volume 6: Model-Based RL