Curriculum

Chapter 27: Dueling DQN

Learning objectives Implement the dueling architecture: shared backbone, then a value stream \(V(s)\) and an advantage stream \(A(s,a)\), with \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\). Understand why separating \(V\) and \(A\) can help when the value of the state is similar across actions (e.g. in safe states). Compare learning speed and final performance with standard DQN on CartPole. Concept and real-world RL In many states, the value of being in that state is similar regardless of the action (e.g. when no danger is nearby). The dueling architecture represents \(Q(s,a) = V(s) + A(s,a)\), but to get identifiability we use \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\). The network learns \(V(s)\) and \(A(s,a)\) in separate heads after a shared feature layer. This can speed up learning when the advantage (difference between actions) is small in many states. Used in Rainbow and other modern DQN variants. ...

Chapter 28: Prioritized Experience Replay (PER)

Learning objectives Implement prioritized replay: assign each transition a priority (e.g. TD error \(|\delta|\)) and sample with probability proportional to \(p_i^\alpha\). Use a sum tree (or a simpler alternative) for efficient sampling and priority updates. Apply importance-sampling weights \(w_i = (N \cdot P(i))^{-\beta} / \max_j w_j\) to correct the bias introduced by non-uniform sampling. Concept and real-world RL Prioritized Experience Replay (PER) samples transitions with probability proportional to their “priority”—often the TD error—so that surprising or informative transitions are replayed more often. This can speed up learning but introduces bias (the update distribution is not the same as the uniform replay distribution). Importance-sampling weights correct for this by weighting the gradient update so that in expectation we recover the uniform case. A sum tree allows O(log N) sampling and priority update. PER is used in Rainbow and other sample-efficient DQN variants. ...

Chapter 29: Noisy Networks for Exploration

Learning objectives Implement noisy linear layers: \(y = (W + \sigma_W \odot \epsilon_W) x + (b + \sigma_b \odot \epsilon_b)\), where \(\epsilon\) is random noise (e.g. Gaussian) and \(\sigma\) are learnable parameters. Use factorized Gaussian noise to reduce the number of random samples: e.g. \(\epsilon_{i,j} = f(\epsilon_i) \cdot f(\epsilon_j)\) with \(f\) such that the product has zero mean and unit variance. Compare exploration (e.g. unique states visited, or variance of actions over time) with \(\epsilon\)-greedy DQN. Concept and real-world RL ...

Chapter 30: Rainbow DQN

Learning objectives Combine Rainbow components: Double DQN, Dueling architecture, Prioritized replay, Noisy networks, and optionally multi-step returns (and distributional RL). Train on a challenging environment (e.g. Pong or another Atari-style env) and compare with a baseline DQN. Understand which components contribute most to sample efficiency and stability. Concept and real-world RL Rainbow (Hessel et al.) combines several DQN improvements: Double DQN (reduce overestimation), Dueling (value + advantage), PER (replay important transitions), Noisy nets (state-dependent exploration), multi-step returns (n-step learning), and optionally C51 (distributional RL). Together they improve sample efficiency and final performance on Atari. In practice, you do not need all components for every task; CartPole may be solved with vanilla DQN, while harder games benefit from the full stack. Implementing Rainbow is a capstone for the value-approximation volume. ...

Chapter 31: Introduction to Policy-Based Methods

Learning objectives Explain when a stochastic policy (outputting a distribution over actions) is essential versus when a deterministic policy suffices. Give a real-world scenario where a deterministic policy would fail (e.g. games with hidden information, adversarial settings). Relate stochastic policies to exploration and to game AI or recommendation where diversity matters. Concept and real-world RL Policy-based methods directly parameterize and optimize the policy \(\pi(a|s;\theta)\) instead of learning a value function and deriving actions from it. A stochastic policy outputs a probability over actions; a deterministic policy always picks the same action in a given state. In game AI, when the opponent can observe or anticipate your move (e.g. poker, rock-paper-scissors), a deterministic policy is exploitable—the opponent will always know what you do. A stochastic policy keeps the opponent uncertain and is essential for mixed strategies. In recommendation, showing a deterministic “best” item every time can create filter bubbles; stochastic policies (or sampling from a distribution) encourage exploration and diversity. For robot navigation in partially observable or noisy settings, randomness can help escape local minima or handle uncertainty. ...

Chapter 32: The Policy Objective Function

Learning objectives Write the policy gradient theorem for a simple one-step MDP: the gradient of expected reward with respect to policy parameters. Show that \(\nabla_\theta \mathbb{E}[R] = \mathbb{E}[ \nabla_\theta \log \pi(a|s;\theta) , Q^\pi(s,a) ]\) (or equivalent for one step). Recognize why this form is useful: we can estimate the expectation from samples (trajectories) without knowing the transition model. Concept and real-world RL In policy gradient methods we maximize the expected return \(J(\theta) = \mathbb{E}\pi[G]\) by gradient ascent on \(\theta\). The policy gradient theorem says that \(\nabla\theta J\) can be written as an expectation over states and actions under \(\pi\), involving \(\nabla_\theta \log \pi(a|s;\theta)\) and the return (or Q). For a one-step MDP (one state, one action, one reward), the derivation is simple: \(J = \sum_a \pi(a|s) r(s,a)\), so \(\nabla_\theta J = \sum_a \nabla_\theta \pi(a|s) , r(s,a)\). Using the log-derivative trick \(\nabla \pi = \pi \nabla \log \pi\), we get \(\mathbb{E}[ \nabla \log \pi(a|s) , Q(s,a) ]\). In robot control or game AI, we rarely have the full model; this identity lets us estimate the gradient from sampled actions and rewards only. ...

Chapter 33: The REINFORCE Algorithm

Learning objectives Implement REINFORCE (Monte Carlo policy gradient): estimate \(\nabla_\theta J\) using the return \(G_t\) from full episodes. Use a neural network policy with softmax output for discrete actions (e.g. CartPole). Observe and explain the high variance of gradient estimates when using raw returns \(G_t\) (no baseline). Concept and real-world RL REINFORCE is the simplest policy gradient algorithm: run an episode under \(\pi_\theta\), compute the return \(G_t\) from each step, and update \(\theta\) with \(\theta \leftarrow \theta + \alpha \sum_t G_t \nabla_\theta \log \pi(a_t|s_t)\). It is on-policy and Monte Carlo (needs full episodes). The variance of \(G_t\) can be large, especially in long episodes, which makes learning slow or unstable. In game AI, REINFORCE is a baseline for more advanced methods (actor-critic, PPO); in robot control, it is rarely used alone because of sample efficiency and variance. Adding a baseline (e.g. state-value function) reduces variance without introducing bias. ...

Chapter 34: Reducing Variance in Policy Gradients

Learning objectives Add a state-value baseline \(V(s)\) to REINFORCE and explain why it reduces variance without introducing bias (when the baseline does not depend on the action). Train the baseline network (e.g. MSE to fit returns \(G_t\)) alongside the policy. Compare the variance of gradient estimates (e.g. magnitude of parameter updates or variance of \(G_t - b(s_t)\)) with and without baseline. Concept and real-world RL The policy gradient with a baseline is \(\mathbb{E}[ \nabla \log \pi(a|s) , (G_t - b(s)) ]\). If \(b(s)\) does not depend on the action \(a\), this is still an unbiased estimate of \(\nabla J\); the baseline only changes the variance. A natural choice is \(b(s) = V^\pi(s)\), the expected return from state \(s\). Then the term \(G_t - V(s_t)\) is an estimate of the advantage (how much better this trajectory was than average). In game AI or robot control, lower-variance gradients mean faster and more stable learning; baselines are standard in actor-critic and PPO. ...

Chapter 35: Actor-Critic Architectures

Learning objectives Sketch the architecture of a two-network actor-critic: actor (policy \(\pi(a|s)\)) and critic (value \(V(s)\) or \(Q(s,a)\)). Write pseudocode for the update steps using the TD error \(\delta = r + \gamma V(s’) - V(s)\) as the advantage for the policy. Explain why the critic reduces variance compared to using Monte Carlo returns \(G_t\). Concept and real-world RL Actor-critic methods maintain two networks: the actor selects actions from \(\pi(a|s;\theta)\), and the critic estimates the value function \(V(s;w)\) (or Q). The TD error \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is a one-step estimate of the advantage; it is biased (because V is approximate) but much lower variance than \(G_t\). The actor is updated with \(\nabla \log \pi(a_t|s_t) , \delta_t\); the critic is updated to minimize \((r_t + \gamma V(s_{t+1}) - V(s_t))^2\). In robot control and game AI, actor-critic allows online, step-by-step updates instead of waiting for episode end, which speeds up learning. ...

Chapter 36: Advantage Actor-Critic (A2C)

Learning objectives Implement A2C (Advantage Actor-Critic): actor updated with TD error as advantage, critic updated to minimize TD error. Use the TD error \(r + \gamma V(s’) - V(s)\) as the advantage (optionally with \(V(s’).detach()\)). Run multiple environments synchronously to collect a batch of transitions and update on the batch (reduces variance further). Concept and real-world RL A2C is the synchronous version of A3C: the agent runs \(N\) environments in parallel, collects a batch of transitions, and performs one update from the batch. The advantage is the TD error (or n-step return minus V(s)). Synchronous batching makes the updates more stable than fully asynchronous A3C. In game AI and robot control, A2C is a simple and effective baseline; it is often used with a shared feature extractor (one backbone, actor and critic heads) to save parameters and improve learning. ...