Chapter 39: Deep Deterministic Policy Gradient (DDPG)
Learning objectives Implement DDPG: deterministic policy (actor) plus Q-function (critic), with target networks and a replay buffer. Use Ornstein-Uhlenbeck (OU) noise (or Gaussian noise) on the action for exploration in continuous spaces. Train on Pendulum-v1 (or similar) and plot the learning curve. Concept and real-world RL DDPG is an actor-critic method for continuous actions: the actor outputs a single action \(\mu(s)\) (no distribution), and the critic learns \(Q(s,a)\). The policy is updated to maximize \(Q(s, \mu(s))\) (gradient through Q into the actor). Target networks and replay buffer stabilize learning (like DQN). Exploration comes from adding noise to the action (e.g. OU noise for temporally correlated exploration, or simple Gaussian). In robot control (Pendulum, MuJoCo), DDPG is a baseline for continuous tasks; TD3 and SAC improve on it with clipped double Q and stochastic policies. ...