Chapter 38: Continuous Action Spaces

Learning objectives Design a policy network for continuous actions that outputs the mean and log-standard deviation of a Gaussian (or similar) distribution. Sample actions from the distribution and compute log-probability \(\log \pi(a|s)\) for use in policy gradient updates. Apply this to an environment with continuous actions (e.g. Pendulum-v1). Concept and real-world RL For continuous actions (e.g. torque, throttle), we cannot use a softmax over a finite set. Instead we use a continuous distribution, often a Gaussian: \(\pi(a|s) = \mathcal{N}(a; \mu(s), \sigma(s)^2)\). The policy network outputs \(\mu(s)\) and \(\log \sigma(s)\) (log-std for stability); we sample \(a = \mu + \sigma \cdot z\) with \(z \sim \mathcal{N}(0,1)\). The log-probability is \(\log \pi(a|s) = -\frac{1}{2}(\log(2\pi) + 2\log\sigma + \frac{(a-\mu)^2}{\sigma^2})\). In robot control (e.g. Pendulum, MuJoCo), actions are continuous; the same Gaussian policy is used in REINFORCE, actor-critic, and PPO for continuous control. ...

March 10, 2026 · 3 min · 533 words · codefrydev

Chapter 39: Deep Deterministic Policy Gradient (DDPG)

Learning objectives Implement DDPG: deterministic policy (actor) plus Q-function (critic), with target networks and a replay buffer. Use Ornstein-Uhlenbeck (OU) noise (or Gaussian noise) on the action for exploration in continuous spaces. Train on Pendulum-v1 (or similar) and plot the learning curve. Concept and real-world RL DDPG is an actor-critic method for continuous actions: the actor outputs a single action \(\mu(s)\) (no distribution), and the critic learns \(Q(s,a)\). The policy is updated to maximize \(Q(s, \mu(s))\) (gradient through Q into the actor). Target networks and replay buffer stabilize learning (like DQN). Exploration comes from adding noise to the action (e.g. OU noise for temporally correlated exploration, or simple Gaussian). In robot control (Pendulum, MuJoCo), DDPG is a baseline for continuous tasks; TD3 and SAC improve on it with clipped double Q and stochastic policies. ...

March 10, 2026 · 3 min · 524 words · codefrydev