Chapter 43: Proximal Policy Optimization (PPO): Intuition

Learning objectives Explain in your own words how the clipped surrogate objective in PPO prevents too-large policy updates without solving a constrained optimization (unlike TRPO). Write the clipped loss \(L^{CLIP}(\theta)\) and the unclipped (ratio-based) objective; contrast when they differ. Relate the clip range \(\epsilon\) (e.g. 0.2) to how much the policy can change in one update. Concept and real-world RL PPO (Proximal Policy Optimization) keeps the policy update conservative by clipping the probability ratio \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\). The objective is \(L^{CLIP} = \mathbb{E}[ \min( r_t \hat{A}_t, \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t ) ]\): if the advantage is positive, we do not let the ratio exceed \(1+\epsilon\); if negative, we do not let it go below \(1-\epsilon\). So we never encourage a huge increase in probability for a good action (which could overshoot) or a huge decrease for a bad one. In robot control, game AI, and dialogue, PPO is the default choice for policy gradient because it is simple, stable, and effective. ...

March 10, 2026 · 3 min · 540 words · codefrydev

Chapter 44: PPO: Implementation Details

Learning objectives Implement Generalized Advantage Estimation (GAE): compute advantage estimates \(\hat{A}_t\) from a trajectory of rewards and value estimates using \(\gamma\) and \(\lambda\). Write the recurrence: \(\hat{A}t = \delta_t + (\gamma\lambda) \delta{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \cdots\) where \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\). Use GAE in a PPO (or actor-critic) pipeline so advantages are fed into the policy loss. Concept and real-world RL GAE (Generalized Advantage Estimation) provides a bias–variance trade-off for the advantage: \(\hat{A}t^{GAE} = \sum{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}\). When \(\lambda=0\), \(\hat{A}_t = \delta_t\) (1-step TD, low variance, high bias). When \(\lambda=1\), \(\hat{A}_t = G_t - V(s_t)\) (Monte Carlo, high variance, low bias). Tuning \(\lambda\) (e.g. 0.95–0.99) balances the two. In robot control and game AI, GAE is the standard way to compute advantages for PPO and actor-critic; it is implemented with a backward loop over the trajectory. ...

March 10, 2026 · 3 min · 482 words · codefrydev

Chapter 45: Coding PPO from Scratch

Learning objectives Implement a full PPO agent for LunarLanderContinuous-v2: policy (actor) and value (critic) networks, rollout buffer, GAE for advantages, and multiple epochs of minibatch updates per rollout. Tune key hyperparameters (learning rate, clip \(\epsilon\), GAE \(\lambda\), batch size, number of epochs) to achieve successful landings. Relate each component (clip, GAE, value loss, entropy bonus) to stability and sample efficiency. Concept and real-world RL PPO in practice: collect a rollout of transitions (e.g. 2048 steps), compute GAE advantages, then perform several epochs of minibatch updates on the same data (policy loss with clip + value loss + entropy bonus). The rollout buffer stores states, actions, rewards, log-probs, and values; after each rollout we compute advantages and then iterate over minibatches. LunarLanderContinuous is a 2D landing task with continuous thrust; it is a standard testbed for PPO. In robot control and game AI, this “collect rollout → multiple PPO epochs” loop is the core of most on-policy algorithms. ...

March 10, 2026 · 3 min · 532 words · codefrydev

Chapter 48: SAC vs. PPO

Learning objectives Run SAC and PPO on the same continuous control tasks (e.g. Hopper, Walker2d). Compare final performance, sample efficiency (return vs env steps), and wall-clock time. Discuss when to choose one over the other (sample efficiency, stability, tuning, off-policy vs on-policy). Concept and real-world RL SAC is off-policy (replay buffer) and maximizes entropy; PPO is on-policy (rollouts) and uses a clipped objective. SAC often achieves higher sample efficiency (fewer env steps to reach good performance) but can be sensitive to hyperparameters and replay buffer size; PPO is more robust and easier to tune in many settings. In robot control benchmarks (Hopper, Walker2d, HalfCheetah), both are standard; in game AI and RLHF, PPO is more common. Choice depends on data cost (can we afford many env steps?), need for off-policy (e.g. using logged data), and engineering preference. ...

March 10, 2026 · 3 min · 481 words · codefrydev

Chapter 51: Model-Free vs. Model-Based RL

Learning objectives Compare model-free (e.g. PPO) and model-based (e.g. Dreamer) RL in terms of sample efficiency on a continuous control task like Walker. Explain why model-based methods can achieve more reward per real environment step (use of imagined rollouts). Identify trade-offs: model bias, computation, and implementation complexity. Concept and real-world RL Model-free methods learn a policy or value function directly from experience; model-based methods learn a dynamics model and use it for planning or imagined rollouts. Model-based RL can be more sample-efficient because each real transition can be reused many times in the model (short rollouts, planning). In robot navigation and trading, where real data is expensive, sample efficiency matters; in game AI, model-based methods (e.g. MuZero) combine learning and planning. The downside is model error (compounding over long rollouts) and extra computation. ...

March 10, 2026 · 3 min · 446 words · codefrydev

Chapter 74: Introduction to Imitation Learning

Learning objectives Collect expert demonstrations (state-action pairs or trajectories) from a trained PPO agent on LunarLander. Train a behavioral cloning (BC) agent: supervised learning to predict the expert’s action given the state. Evaluate the BC policy in the environment and compare its return and behavior to the expert. Explain the assumptions of behavioral cloning (i.i.d. states from the expert distribution) and when it works well. Relate imitation learning to robot navigation (learning from human demos) and dialogue (learning from human responses). Concept and real-world RL ...

March 10, 2026 · 3 min · 626 words · codefrydev

Chapter 80: RL from Human Feedback (RLHF) Basics

Learning objectives Implement a Bradley-Terry model to learn a reward function from pairwise comparisons of two trajectories (or segments): given (τ^w, τ^l) meaning “τ^w is preferred over τ^l,” fit r so that E[r(τ^w)] > E[r(τ^l)]. Use the learned reward to train a policy with PPO (or another policy gradient method): maximize expected return under r. Explain the RLHF pipeline: collect preferences → train reward model → train policy on reward model. Test on a simple environment with simulated preferences (e.g. prefer longer/higher-return trajectories) and verify the policy improves. Relate RLHF to dialogue (prefer helpful/harmless responses) and recommendation (prefer engaging content). Concept and real-world RL ...

March 10, 2026 · 4 min · 708 words · codefrydev

Chapter 95: Training Large Language Models with PPO

Learning objectives Implement a PPO loop to fine-tune a small language model (e.g. GPT-2 small or DistilGPT-2) for text generation with a simple reward (e.g. positive sentiment, or length). Include a KL penalty (or KL constraint) so that the updated policy does not deviate too far from the initial (reference) policy, preventing mode collapse and maintaining fluency. Generate sequences with the current policy, compute reward for each sequence, and update the policy with PPO (clip + KL). Observe that without KL penalty the policy may collapse (e.g. always output the same high-reward token); with KL it stays diverse. Relate to dialogue and RLHF: same PPO+KL setup is used for aligning LMs with human preferences. Concept and real-world RL ...

March 10, 2026 · 4 min · 730 words · codefrydev

Chapter 96: Implementing RLHF in NLP

Learning objectives Collect (or simulate) human preference data: pairs of model responses to the same prompt, with a label indicating which response is preferred. Train a reward model using the Bradley-Terry loss: P(τ^w preferred over τ^l) = σ(r(τ^w) - r(τ^l)), where r is the reward model (e.g. LM that outputs a scalar or a separate head). Fine-tune the language model with PPO using the learned reward model as the reward (and a KL penalty to the initial LM). Evaluate on held-out prompts: generate with the fine-tuned LM and score with the reward model; optionally compare with the initial LM. Relate to the dialogue anchor and real RLHF pipelines (InstructGPT, etc.). Concept and real-world RL ...

March 10, 2026 · 4 min · 705 words · codefrydev

Chapter 97: Direct Preference Optimization (DPO)

Learning objectives Derive the DPO loss from the Bradley-Terry preference model and the optimal policy under a KL constraint to the reference policy (the closed-form mapping from reward to policy in the BT model). Implement DPO: train the language model directly on preference data (prefer τ^w over τ^l) using the DPO loss, without training a separate reward model. Compare with PPO (reward model + PPO fine-tuning) in terms of preference accuracy, reward model score, and implementation complexity. Explain the advantage of DPO: no reward model, no PPO loop; just supervised loss on preferences. Relate DPO to dialogue and RLHF (alternative to reward model + PPO). Concept and real-world RL ...

March 10, 2026 · 4 min · 670 words · codefrydev