Chapter 31: Introduction to Policy-Based Methods

Learning objectives Explain when a stochastic policy (outputting a distribution over actions) is essential versus when a deterministic policy suffices. Give a real-world scenario where a deterministic policy would fail (e.g. games with hidden information, adversarial settings). Relate stochastic policies to exploration and to game AI or recommendation where diversity matters. Concept and real-world RL Policy-based methods directly parameterize and optimize the policy \(\pi(a|s;\theta)\) instead of learning a value function and deriving actions from it. A stochastic policy outputs a probability over actions; a deterministic policy always picks the same action in a given state. In game AI, when the opponent can observe or anticipate your move (e.g. poker, rock-paper-scissors), a deterministic policy is exploitable—the opponent will always know what you do. A stochastic policy keeps the opponent uncertain and is essential for mixed strategies. In recommendation, showing a deterministic “best” item every time can create filter bubbles; stochastic policies (or sampling from a distribution) encourage exploration and diversity. For robot navigation in partially observable or noisy settings, randomness can help escape local minima or handle uncertainty. ...

March 10, 2026 · 3 min · 547 words · codefrydev

Chapter 33: The REINFORCE Algorithm

Learning objectives Implement REINFORCE (Monte Carlo policy gradient): estimate \(\nabla_\theta J\) using the return \(G_t\) from full episodes. Use a neural network policy with softmax output for discrete actions (e.g. CartPole). Observe and explain the high variance of gradient estimates when using raw returns \(G_t\) (no baseline). Concept and real-world RL REINFORCE is the simplest policy gradient algorithm: run an episode under \(\pi_\theta\), compute the return \(G_t\) from each step, and update \(\theta\) with \(\theta \leftarrow \theta + \alpha \sum_t G_t \nabla_\theta \log \pi(a_t|s_t)\). It is on-policy and Monte Carlo (needs full episodes). The variance of \(G_t\) can be large, especially in long episodes, which makes learning slow or unstable. In game AI, REINFORCE is a baseline for more advanced methods (actor-critic, PPO); in robot control, it is rarely used alone because of sample efficiency and variance. Adding a baseline (e.g. state-value function) reduces variance without introducing bias. ...

March 10, 2026 · 3 min · 602 words · codefrydev

Chapter 34: Reducing Variance in Policy Gradients

Learning objectives Add a state-value baseline \(V(s)\) to REINFORCE and explain why it reduces variance without introducing bias (when the baseline does not depend on the action). Train the baseline network (e.g. MSE to fit returns \(G_t\)) alongside the policy. Compare the variance of gradient estimates (e.g. magnitude of parameter updates or variance of \(G_t - b(s_t)\)) with and without baseline. Concept and real-world RL The policy gradient with a baseline is \(\mathbb{E}[ \nabla \log \pi(a|s) , (G_t - b(s)) ]\). If \(b(s)\) does not depend on the action \(a\), this is still an unbiased estimate of \(\nabla J\); the baseline only changes the variance. A natural choice is \(b(s) = V^\pi(s)\), the expected return from state \(s\). Then the term \(G_t - V(s_t)\) is an estimate of the advantage (how much better this trajectory was than average). In game AI or robot control, lower-variance gradients mean faster and more stable learning; baselines are standard in actor-critic and PPO. ...

March 10, 2026 · 3 min · 593 words · codefrydev