Chapter 31: Introduction to Policy-Based Methods

Learning objectives Explain when a stochastic policy (outputting a distribution over actions) is essential versus when a deterministic policy suffices. Give a real-world scenario where a deterministic policy would fail (e.g. games with hidden information, adversarial settings). Relate stochastic policies to exploration and to game AI or recommendation where diversity matters. Concept and real-world RL Policy-based methods directly parameterize and optimize the policy \(\pi(a|s;\theta)\) instead of learning a value function and deriving actions from it. A stochastic policy outputs a probability over actions; a deterministic policy always picks the same action in a given state. In game AI, when the opponent can observe or anticipate your move (e.g. poker, rock-paper-scissors), a deterministic policy is exploitable—the opponent will always know what you do. A stochastic policy keeps the opponent uncertain and is essential for mixed strategies. In recommendation, showing a deterministic “best” item every time can create filter bubbles; stochastic policies (or sampling from a distribution) encourage exploration and diversity. For robot navigation in partially observable or noisy settings, randomness can help escape local minima or handle uncertainty. ...

March 10, 2026 · 3 min · 547 words · codefrydev

Chapter 32: The Policy Objective Function

Learning objectives Write the policy gradient theorem for a simple one-step MDP: the gradient of expected reward with respect to policy parameters. Show that \(\nabla_\theta \mathbb{E}[R] = \mathbb{E}[ \nabla_\theta \log \pi(a|s;\theta) , Q^\pi(s,a) ]\) (or equivalent for one step). Recognize why this form is useful: we can estimate the expectation from samples (trajectories) without knowing the transition model. Concept and real-world RL In policy gradient methods we maximize the expected return \(J(\theta) = \mathbb{E}\pi[G]\) by gradient ascent on \(\theta\). The policy gradient theorem says that \(\nabla\theta J\) can be written as an expectation over states and actions under \(\pi\), involving \(\nabla_\theta \log \pi(a|s;\theta)\) and the return (or Q). For a one-step MDP (one state, one action, one reward), the derivation is simple: \(J = \sum_a \pi(a|s) r(s,a)\), so \(\nabla_\theta J = \sum_a \nabla_\theta \pi(a|s) , r(s,a)\). Using the log-derivative trick \(\nabla \pi = \pi \nabla \log \pi\), we get \(\mathbb{E}[ \nabla \log \pi(a|s) , Q(s,a) ]\). In robot control or game AI, we rarely have the full model; this identity lets us estimate the gradient from sampled actions and rewards only. ...

March 10, 2026 · 3 min · 585 words · codefrydev

Chapter 41: The Problem with Standard Policy Gradients

Learning objectives Demonstrate how a too-large step size in policy gradient updates can cause policy collapse (e.g. one action gets probability near 1 too quickly) and loss of exploration. Visualize policy probabilities over time in a simple bandit problem under different learning rates. Relate this to the motivation for trust region and clipped methods (e.g. PPO, TRPO). Concept and real-world RL Standard policy gradient \(\theta \leftarrow \theta + \alpha \nabla_\theta J\) can be unstable: a single bad batch or a large step can make the policy assign near-zero probability to previously good actions (policy collapse). In a multi-armed bandit (or a simple MDP), this is easy to see: with a large \(\alpha\), the policy can become deterministic too fast and get stuck. In robot control and game AI, we want to avoid catastrophic updates; PPO (clipped objective) and TRPO (KL constraint) limit how much the policy can change per update. This chapter illustrates the problem in a minimal setting. ...

March 10, 2026 · 3 min · 563 words · codefrydev

Chapter 77: Generative Adversarial Imitation Learning (GAIL)

Learning objectives Implement GAIL: train a discriminator D(s, a) to distinguish state-action pairs from the expert vs from the current policy; use the discriminator output (or log D) as reward for a policy gradient method. Train the policy to maximize the discriminator reward (i.e. to fool the discriminator) while the discriminator tries to tell expert from agent. Test on a simple task (e.g. CartPole or MuJoCo) and compare imitation quality with behavioral cloning. Explain the connection to GANs: the policy is the generator, the discriminator gives the learning signal. Relate GAIL to robot navigation and game AI where we have expert demos and want to match the expert distribution without hand-designed rewards. Concept and real-world RL ...

March 10, 2026 · 4 min · 704 words · codefrydev

Function Approximation and Deep RL

This page covers function approximation and deep RL concepts you need for the preliminary assessment: why we need FA, the policy gradient update, exploration in DQN, experience replay, and the advantage of actor-critic. Back to Preliminary. Why this matters for RL In large or continuous state spaces we cannot store a value per state; we use a parameterized function (e.g. neural network) to approximate values or policies. That leads to policy gradient methods (maximize return) and value-based methods with FA (e.g. DQN). DQN uses experience replay and exploration (e.g. ε-greedy); actor-critic combines a policy (actor) and a value function (critic) for lower-variance policy gradients. You need to understand why FA is necessary and how these pieces fit together. ...

March 10, 2026 · 7 min · 1400 words · codefrydev

Phase 4 Deep RL Quiz

Use this quiz after completing Volumes 3–5 (or the Phase 4 coding challenges). If you can answer at least 9 of 12 correctly, you are ready for Phase 5 and Volume 6. 1. Function approximation Q: Why is function approximation necessary in RL for large or continuous state spaces? Answer Tabular methods store one value per state (or state-action); the number of states can be huge or infinite. Function approximation uses a parameterized function (e.g. neural network) so a fixed number of parameters represent values for all states and generalize from seen to unseen states. ...

March 10, 2026 · 4 min · 814 words · codefrydev