Chapter 39: Deep Deterministic Policy Gradient (DDPG)

Learning objectives

Implement DDPG: deterministic policy (actor) plus Q-function (critic), with target networks and a replay buffer.
Use Ornstein-Uhlenbeck (OU) noise (or Gaussian noise) on the action for exploration in continuous spaces.
Train on Pendulum-v1 (or similar) and plot the learning curve.

Concept and real-world RL

DDPG is an actor-critic method for continuous actions: the actor outputs a single action \(\mu(s)\) (no distribution), and the critic learns \(Q(s,a)\). The policy is updated to maximize \(Q(s, \mu(s))\) (gradient through Q into the actor). Target networks and replay buffer stabilize learning (like DQN). Exploration comes from adding noise to the action (e.g. OU noise for temporally correlated exploration, or simple Gaussian). In robot control (Pendulum, MuJoCo), DDPG is a baseline for continuous tasks; TD3 and SAC improve on it with clipped double Q and stochastic policies.

Where you see this in practice: DDPG is used in robotics, simulated control, and trading. TD3 and SAC are often preferred for better stability and sample efficiency.

Illustration (DDPG learning curve): On Pendulum, episode return typically improves as the actor and critic learn. The chart below shows a typical curve (return every 10 episodes).

Exercise: Implement DDPG for the Pendulum-v1 environment. Use actor and critic networks with target networks and a replay buffer. Add Ornstein-Uhlenbeck noise for exploration. Plot the learning curve.

Professor’s hints

Replay buffer: store (s, a, r, s’, done). Sample mini-batches for training. Buffer size 1e5–1e6.
Critic: input (s, a), output scalar Q(s,a). Loss = MSE between Q(s,a) and target \(y = r + \gamma (1-\mathrm{done}) Q_{target}(s’, \mu_{target}(s’))\). Use target networks for \(Q_{target}\) and \(\mu_{target}\).
Actor: input s, output action \(\mu(s)\). Loss = -\(Q(s, \mu(s))\) (maximize Q). Gradient flows from Q through \(\mu(s)\) into the actor.
OU noise: \(d x = \theta (\mu - x) dt + \sigma dW\); in discrete form, \(x_{t+1} = x_t + \theta (\mu - x_t) + \sigma N(0,1)\). Add \(x\) to the action. Or use Gaussian noise for simplicity.
Update targets: soft update \(\theta’ \leftarrow \tau \theta’ + (1-\tau)\theta\) every step, or hard update every N steps.

Common pitfalls

Forgetting to clamp actions: Pendulum may expect actions in \([-2, 2]\); after adding noise, clip to the env’s action space.
No gradient through target: When computing \(y\), use \(\mu_{target}(s’)\) and \(Q_{target}(s’, a’)\) with .detach() so the target is not updated by the current batch.

Worked solution (warm-up: why DDPG needs replay and target)

Warm-up: Replay buffer: to decorrelate consecutive samples (i.i.d. batches) and reuse experience for sample efficiency, same as DQN. Target networks: to stabilize the TD target \(y = r + \gamma Q_{target}(s’, \mu_{target}(s’))\); without a target, the label would change every update and learning would be unstable. DDPG uses soft updates (\(\tau\)) to slowly track the online network.

Extra practice

Warm-up: Why does DDPG need a replay buffer and target networks? (Same reasons as DQN: decorrelate samples and stabilize targets.)
Coding: Implement DDPG for Pendulum. Plot episode return every 10 episodes. How many episodes until you reach an average return of -200 or better?
Challenge: Replace OU noise with Gaussian noise (zero mean, constant std). Compare learning speed and final performance. Then try decaying the noise std over time (exploration schedule).
Variant: Try different OU noise parameters: \(\theta=0.1\) (slow decay) vs \(\theta=0.5\) (fast decay). Does slower mean reversion improve exploration early in training?
Debug: The code below forgets to detach the target networks when computing the Bellman target \(y\), causing incorrect gradient flow into the target network parameters. Fix it.

Try it — edit and run (Shift+Enter)

Conceptual: Why does DDPG’s deterministic actor enable the policy gradient to be computed analytically as \(\nabla_{\theta^\mu} J \approx \nabla_a Q(s,a)|{a=\mu(s)} \nabla{\theta^\mu} \mu(s)\)? Why can’t we use this formula with a stochastic policy?
Recall: Describe the DDPG soft target update rule \(\theta’ \leftarrow \tau\theta + (1-\tau)\theta’\) from memory and explain what \(\tau\) controls.