Chapter 37: Asynchronous Advantage Actor-Critic (A3C)

Learning objectives Implement A3C: multiple worker processes each running an environment and asynchronously updating a global shared network. Understand the trade-off: A3C can be faster on multi-core CPUs (no synchronization wait) but is often less stable than A2C due to asynchronous gradient updates. Compare training speed (wall clock and/or sample efficiency) of A3C vs A2C on CartPole. Concept and real-world RL A3C (Asynchronous Advantage Actor-Critic) runs multiple workers in parallel, each collecting experience and pushing gradient updates to a global network. Workers do not wait for each other, so gradients are asynchronous and potentially stale. In game AI and early deep RL, A3C was popular for leveraging many CPU cores; in practice, A2C (synchronous) or PPO often give more stable and reproducible results. The idea of parallel envs and shared parameters remains central; the main difference is sync (A2C) vs async (A3C) updates. ...

March 10, 2026 · 3 min · 556 words · codefrydev

Chapter 38: Continuous Action Spaces

Learning objectives Design a policy network for continuous actions that outputs the mean and log-standard deviation of a Gaussian (or similar) distribution. Sample actions from the distribution and compute log-probability \(\log \pi(a|s)\) for use in policy gradient updates. Apply this to an environment with continuous actions (e.g. Pendulum-v1). Concept and real-world RL For continuous actions (e.g. torque, throttle), we cannot use a softmax over a finite set. Instead we use a continuous distribution, often a Gaussian: \(\pi(a|s) = \mathcal{N}(a; \mu(s), \sigma(s)^2)\). The policy network outputs \(\mu(s)\) and \(\log \sigma(s)\) (log-std for stability); we sample \(a = \mu + \sigma \cdot z\) with \(z \sim \mathcal{N}(0,1)\). The log-probability is \(\log \pi(a|s) = -\frac{1}{2}(\log(2\pi) + 2\log\sigma + \frac{(a-\mu)^2}{\sigma^2})\). In robot control (e.g. Pendulum, MuJoCo), actions are continuous; the same Gaussian policy is used in REINFORCE, actor-critic, and PPO for continuous control. ...

March 10, 2026 · 3 min · 533 words · codefrydev

Chapter 39: Deep Deterministic Policy Gradient (DDPG)

Learning objectives Implement DDPG: deterministic policy (actor) plus Q-function (critic), with target networks and a replay buffer. Use Ornstein-Uhlenbeck (OU) noise (or Gaussian noise) on the action for exploration in continuous spaces. Train on Pendulum-v1 (or similar) and plot the learning curve. Concept and real-world RL DDPG is an actor-critic method for continuous actions: the actor outputs a single action \(\mu(s)\) (no distribution), and the critic learns \(Q(s,a)\). The policy is updated to maximize \(Q(s, \mu(s))\) (gradient through Q into the actor). Target networks and replay buffer stabilize learning (like DQN). Exploration comes from adding noise to the action (e.g. OU noise for temporally correlated exploration, or simple Gaussian). In robot control (Pendulum, MuJoCo), DDPG is a baseline for continuous tasks; TD3 and SAC improve on it with clipped double Q and stochastic policies. ...

March 10, 2026 · 3 min · 524 words · codefrydev

Chapter 40: Twin Delayed DDPG (TD3)

Learning objectives Implement TD3 improvements over DDPG: two critics (clipped double Q-learning), delayed policy updates (update actor less often than critic), and target policy smoothing (add noise to the target action). Compare performance on a continuous control task (e.g. HalfCheetah if feasible, or Pendulum / BipedalWalker) with vanilla DDPG. Concept and real-world RL TD3 (Twin Delayed DDPG) addresses DDPG’s overestimation and instability: (1) Two Q-networks: take the minimum of the two Q-values for the target (like Double DQN), reducing overestimation. (2) Delayed policy updates: update the actor every \(d\) critic updates so the critic is more accurate before the actor is trained. (3) Target policy smoothing: add small Gaussian noise to \(\mu_{target}(s’)\) when computing the target, so the target is less sensitive to the exact action. In robot control and simulated benchmarks (HalfCheetah, Hopper), TD3 often achieves better and more stable performance than DDPG. ...

March 10, 2026 · 3 min · 555 words · codefrydev

Chapter 41: The Problem with Standard Policy Gradients

Learning objectives Demonstrate how a too-large step size in policy gradient updates can cause policy collapse (e.g. one action gets probability near 1 too quickly) and loss of exploration. Visualize policy probabilities over time in a simple bandit problem under different learning rates. Relate this to the motivation for trust region and clipped methods (e.g. PPO, TRPO). Concept and real-world RL Standard policy gradient \(\theta \leftarrow \theta + \alpha \nabla_\theta J\) can be unstable: a single bad batch or a large step can make the policy assign near-zero probability to previously good actions (policy collapse). In a multi-armed bandit (or a simple MDP), this is easy to see: with a large \(\alpha\), the policy can become deterministic too fast and get stuck. In robot control and game AI, we want to avoid catastrophic updates; PPO (clipped objective) and TRPO (KL constraint) limit how much the policy can change per update. This chapter illustrates the problem in a minimal setting. ...

March 10, 2026 · 3 min · 563 words · codefrydev

Chapter 42: Trust Region Policy Optimization (TRPO)

Learning objectives Read and summarize the TRPO paper: the constrained optimization problem (maximize expected advantage subject to KL constraint between old and new policy). Explain why the natural gradient (using the Fisher information matrix) approximates the KL-constrained step. Relate the KL constraint to preventing too-large policy updates (connection to Chapter 41). Concept and real-world RL TRPO (Trust Region Policy Optimization) limits each policy update so that the new policy stays close to the old one in the sense of KL divergence: maximize \(\mathbb{E}[ \frac{\pi(a|s)}{\pi_{old}(a|s)} A^{old}(s,a) ]\) subject to \(\mathbb{E}[ D_{KL}(\pi_{old} \| \pi) ] \leq \delta\). This prevents the collapse and instability seen in vanilla policy gradients. The natural gradient (preconditioning by the Fisher information matrix) gives an approximate solution to this constrained problem. In robot control and safety-critical settings, TRPO’s monotonic improvement guarantee (under assumptions) is appealing; in practice PPO is often preferred for its simpler implementation (clipped objective instead of constrained optimization). ...

March 10, 2026 · 3 min · 551 words · codefrydev

Chapter 43: Proximal Policy Optimization (PPO): Intuition

Learning objectives Explain in your own words how the clipped surrogate objective in PPO prevents too-large policy updates without solving a constrained optimization (unlike TRPO). Write the clipped loss \(L^{CLIP}(\theta)\) and the unclipped (ratio-based) objective; contrast when they differ. Relate the clip range \(\epsilon\) (e.g. 0.2) to how much the policy can change in one update. Concept and real-world RL PPO (Proximal Policy Optimization) keeps the policy update conservative by clipping the probability ratio \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\). The objective is \(L^{CLIP} = \mathbb{E}[ \min( r_t \hat{A}_t, \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t ) ]\): if the advantage is positive, we do not let the ratio exceed \(1+\epsilon\); if negative, we do not let it go below \(1-\epsilon\). So we never encourage a huge increase in probability for a good action (which could overshoot) or a huge decrease for a bad one. In robot control, game AI, and dialogue, PPO is the default choice for policy gradient because it is simple, stable, and effective. ...

March 10, 2026 · 3 min · 540 words · codefrydev

Chapter 44: PPO: Implementation Details

Learning objectives Implement Generalized Advantage Estimation (GAE): compute advantage estimates \(\hat{A}_t\) from a trajectory of rewards and value estimates using \(\gamma\) and \(\lambda\). Write the recurrence: \(\hat{A}t = \delta_t + (\gamma\lambda) \delta{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \cdots\) where \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\). Use GAE in a PPO (or actor-critic) pipeline so advantages are fed into the policy loss. Concept and real-world RL GAE (Generalized Advantage Estimation) provides a bias–variance trade-off for the advantage: \(\hat{A}t^{GAE} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}\). When \(\lambda=0\), \(\hat{A}_t = \delta_t\) (1-step TD, low variance, high bias). When \(\lambda=1\), \(\hat{A}_t = G_t - V(s_t)\) (Monte Carlo, high variance, low bias). Tuning \(\lambda\) (e.g. 0.95–0.99) balances the two. In robot control and game AI, GAE is the standard way to compute advantages for PPO and actor-critic; it is implemented with a backward loop over the trajectory. ...

March 10, 2026 · 3 min · 482 words · codefrydev

Chapter 45: Coding PPO from Scratch

Learning objectives Implement a full PPO agent for LunarLanderContinuous-v2: policy (actor) and value (critic) networks, rollout buffer, GAE for advantages, and multiple epochs of minibatch updates per rollout. Tune key hyperparameters (learning rate, clip \(\epsilon\), GAE \(\lambda\), batch size, number of epochs) to achieve successful landings. Relate each component (clip, GAE, value loss, entropy bonus) to stability and sample efficiency. Concept and real-world RL PPO in practice: collect a rollout of transitions (e.g. 2048 steps), compute GAE advantages, then perform several epochs of minibatch updates on the same data (policy loss with clip + value loss + entropy bonus). The rollout buffer stores states, actions, rewards, log-probs, and values; after each rollout we compute advantages and then iterate over minibatches. LunarLanderContinuous is a 2D landing task with continuous thrust; it is a standard testbed for PPO. In robot control and game AI, this “collect rollout → multiple PPO epochs” loop is the core of most on-policy algorithms. ...

March 10, 2026 · 3 min · 532 words · codefrydev

Chapter 46: Maximum Entropy RL

Learning objectives Derive or state the maximum entropy objective: maximize \(\mathbb{E}[ \sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) ]\) (or equivalent), where \(\mathcal{H}\) is entropy. Explain how the entropy term encourages exploration: higher entropy means more uniform action distribution, so the policy tries more actions. Contrast with standard expected return maximization (no entropy bonus). Concept and real-world RL Maximum entropy RL adds an entropy bonus to the objective so the agent maximizes return and policy entropy. The optimal policy under this objective is more stochastic (explores more) and is often easier to learn (multiple modes, robustness). In robot control, SAC (Soft Actor-Critic) uses this idea with automatic temperature tuning; in game AI and recommendation, entropy regularization (e.g. in PPO) prevents the policy from becoming too deterministic too fast. The temperature \(\alpha\) (or equivalent) controls the trade-off between return and entropy. ...

March 10, 2026 · 3 min · 500 words · codefrydev