Chapter 40: Twin Delayed DDPG (TD3)

Learning objectives Implement TD3 improvements over DDPG: two critics (clipped double Q-learning), delayed policy updates (update actor less often than critic), and target policy smoothing (add noise to the target action). Compare performance on a continuous control task (e.g. HalfCheetah if feasible, or Pendulum / BipedalWalker) with vanilla DDPG. Concept and real-world RL TD3 (Twin Delayed DDPG) addresses DDPG’s overestimation and instability: (1) Two Q-networks: take the minimum of the two Q-values for the target (like Double DQN), reducing overestimation. (2) Delayed policy updates: update the actor every \(d\) critic updates so the critic is more accurate before the actor is trained. (3) Target policy smoothing: add small Gaussian noise to \(\mu_{target}(s’)\) when computing the target, so the target is less sensitive to the exact action. In robot control and simulated benchmarks (HalfCheetah, Hopper), TD3 often achieves better and more stable performance than DDPG. ...

March 10, 2026 · 3 min · 555 words · codefrydev