Chapter 28: Prioritized Experience Replay (PER)

Learning objectives Implement prioritized replay: assign each transition a priority (e.g. TD error \(|\delta|\)) and sample with probability proportional to \(p_i^\alpha\). Use a sum tree (or a simpler alternative) for efficient sampling and priority updates. Apply importance-sampling weights \(w_i = (N \cdot P(i))^{-\beta} / \max_j w_j\) to correct the bias introduced by non-uniform sampling. Concept and real-world RL Prioritized Experience Replay (PER) samples transitions with probability proportional to their “priority”—often the TD error—so that surprising or informative transitions are replayed more often. This can speed up learning but introduces bias (the update distribution is not the same as the uniform replay distribution). Importance-sampling weights correct for this by weighting the gradient update so that in expectation we recover the uniform case. A sum tree allows O(log N) sampling and priority update. PER is used in Rainbow and other sample-efficient DQN variants. ...

March 10, 2026 · 3 min · 633 words · codefrydev

Chapter 30: Rainbow DQN

Learning objectives Combine Rainbow components: Double DQN, Dueling architecture, Prioritized replay, Noisy networks, and optionally multi-step returns (and distributional RL). Train on a challenging environment (e.g. Pong or another Atari-style env) and compare with a baseline DQN. Understand which components contribute most to sample efficiency and stability. Concept and real-world RL Rainbow (Hessel et al.) combines several DQN improvements: Double DQN (reduce overestimation), Dueling (value + advantage), PER (replay important transitions), Noisy nets (state-dependent exploration), multi-step returns (n-step learning), and optionally C51 (distributional RL). Together they improve sample efficiency and final performance on Atari. In practice, you do not need all components for every task; CartPole may be solved with vanilla DQN, while harder games benefit from the full stack. Implementing Rainbow is a capstone for the value-approximation volume. ...

March 10, 2026 · 3 min · 586 words · codefrydev