Chapter 38: Continuous Action Spaces

Learning objectives Design a policy network for continuous actions that outputs the mean and log-standard deviation of a Gaussian (or similar) distribution. Sample actions from the distribution and compute log-probability \(\log \pi(a|s)\) for use in policy gradient updates. Apply this to an environment with continuous actions (e.g. Pendulum-v1). Concept and real-world RL For continuous actions (e.g. torque, throttle), we cannot use a softmax over a finite set. Instead we use a continuous distribution, often a Gaussian: \(\pi(a|s) = \mathcal{N}(a; \mu(s), \sigma(s)^2)\). The policy network outputs \(\mu(s)\) and \(\log \sigma(s)\) (log-std for stability); we sample \(a = \mu + \sigma \cdot z\) with \(z \sim \mathcal{N}(0,1)\). The log-probability is \(\log \pi(a|s) = -\frac{1}{2}(\log(2\pi) + 2\log\sigma + \frac{(a-\mu)^2}{\sigma^2})\). In robot control (e.g. Pendulum, MuJoCo), actions are continuous; the same Gaussian policy is used in REINFORCE, actor-critic, and PPO for continuous control. ...

March 10, 2026 · 3 min · 533 words · codefrydev