Volume 3 Review & Bridge to Volume 4

Volume 3 Recap Quiz (5 questions)

Q1. What two techniques does DQN add to basic Q-learning to stabilize training?

Experience replay: store transitions in a buffer, sample random mini-batches (breaks temporal correlation).
Target network: a separate copy of Q, updated less frequently — prevents the target from shifting every step.

Q2. What is the 'deadly triad'?

The combination of function approximation + bootstrapping + off-policy learning. Each alone is fine; together they can cause divergence in Q-learning with neural networks.

Q3. What does Dueling DQN decompose, and why?

Q(s,a) = V(s) + A(s,a). V(s) captures the value of the state regardless of action; A(s,a) captures the relative value of each action. This helps when many actions have similar Q-values — the network can learn V accurately even when action effects are small.

Q4. Why does Double DQN reduce overestimation?

Standard DQN uses the same network to select and evaluate the action: max_{a’} Q(s’,a’; θ⁻). This overestimates Q because the max of noisy estimates is biased upward. Double DQN separates these: use the online network to SELECT the action (argmax), use the target network to EVALUATE it. This decorrelates selection and evaluation.

Q5. What is the main limitation of DQN for continuous action spaces?

DQN requires computing max_a Q(s,a) over all actions (at every step). For continuous actions (e.g. joint torques), this max is intractable — infinite actions. Volume 4 introduces policy gradient methods that can directly output continuous actions.

What Changes in Volume 4

	Volume 3 (Value-based)	Volume 4 (Policy-based)
What is parameterized	Q(s,a; θ)	π(a\|s; θ) directly
Action space	Discrete (DQN takes argmax)	Discrete or continuous
Update signal	TD error (supervised-like)	Policy gradient (REINFORCE)
On/Off-policy	Off-policy (replay)	Mostly on-policy
Sample efficiency	Better (replay buffer)	Worse (needs fresh data)

The big insight: Instead of learning Q and deriving the policy, directly parameterise π and optimise it. This enables continuous control (robotics) and naturally stochastic policies (games with hidden info).

Bridge Exercise: Why DQN Fails on Continuous Actions

Try it — edit and run (Shift+Enter)

Next: Volume 4: Policy Gradients

Volume 3 Recap Quiz (5 questions)#

What Changes in Volume 4#

Bridge Exercise: Why DQN Fails on Continuous Actions#

Volume 3 Recap Quiz (5 questions)

What Changes in Volume 4

Bridge Exercise: Why DQN Fails on Continuous Actions