Chapter 23: Deep Q-Networks (DQN)

Learning objectives Implement full DQN: Q-network, target network, replay buffer, \(\epsilon\)-greedy, and the TD loss (MSE to target \(r + \gamma \max_{a’} Q_{target}(s’,a’)\)). Update the target network periodically (e.g. every 100 steps) by copying the online Q-network. Train on CartPole and plot reward per episode. Concept and real-world RL DQN combines a neural network for Q-values with experience replay (store transitions, sample random minibatches to break correlation) and a target network (separate copy of the network used in the TD target, updated periodically, to stabilize learning). The agent acts \(\epsilon\)-greedy, stores \((s,a,r,s’,\text{done})\) in the buffer, and repeatedly samples a batch, computes targets using the target network, and updates the online network by minimizing MSE. DQN was the first major deep RL success (Atari) and is still a standard baseline for discrete-action tasks. ...

March 10, 2026 · 3 min · 545 words · codefrydev

Chapter 24: Experience Replay

Learning objectives Implement a replay buffer that stores transitions \((s, a, r, s’, \text{done})\) with a fixed capacity. Use a circular buffer (overwrite oldest when full) and random sampling for minibatches. Test the buffer with random data and verify shapes and sampling behavior. Concept and real-world RL Experience replay stores past transitions and samples random minibatches for training. It breaks the correlation between consecutive samples (which would cause unstable updates if we trained only on the last transition) and reuses data for sample efficiency. DQN and many off-policy algorithms rely on it. The buffer is usually a circular buffer: when full, new transitions overwrite the oldest. Sampling uniformly at random (or with prioritization in advanced variants) gives unbiased minibatches. In practice, buffer size is a hyperparameter (e.g. 10k–1M); too small limits diversity, too large uses more memory and can slow learning if the policy has changed a lot. ...

March 10, 2026 · 3 min · 596 words · codefrydev

Chapter 25: Target Networks

Learning objectives Implement hard target updates: copy online network parameters to the target network every \(N\) steps. Implement soft target updates: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) each step (or each update). Compare stability of Q-value estimates and learning curves for both update rules. Concept and real-world RL The target network in DQN provides a stable TD target: we use \(Q_{target}(s’,a’)\) instead of \(Q(s’,a’)\) so that the target does not change every time we update the online network, which would cause moving targets and instability. Hard update: copy full parameters every \(N\) steps (classic DQN). Soft update: slowly track the online network: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) with small \(\tau\) (e.g. 0.001). Soft updates change the target every step but by a small amount, often yielding smoother learning. Both are used in practice (e.g. DDPG uses soft updates). ...

March 10, 2026 · 3 min · 596 words · codefrydev

Chapter 26: Double DQN (DDQN)

Learning objectives Implement Double DQN: use the online network to choose \(a^* = \arg\max_a Q_{online}(s’,a)\), then use \(Q_{target}(s’, a^*)\) as the TD target (instead of \(\max_a Q_{target}(s’,a)\)). Understand why this reduces overestimation of Q-values (max of estimates is biased high). Compare average Q-values and reward curves with standard DQN on CartPole. Concept and real-world RL Standard DQN uses \(y = r + \gamma \max_{a’} Q_{target}(s’,a’)\). The max over noisy estimates is biased upward (overestimation), which can hurt learning. Double DQN decouples action selection from evaluation: the online network selects \(a^\), the target network evaluates \(Q_{target}(s’, a^)\). This reduces overestimation and often improves stability and final performance. It is a small code change and is commonly used in modern DQN variants (e.g. Rainbow). ...

March 10, 2026 · 3 min · 523 words · codefrydev

Chapter 30: Rainbow DQN

Learning objectives Combine Rainbow components: Double DQN, Dueling architecture, Prioritized replay, Noisy networks, and optionally multi-step returns (and distributional RL). Train on a challenging environment (e.g. Pong or another Atari-style env) and compare with a baseline DQN. Understand which components contribute most to sample efficiency and stability. Concept and real-world RL Rainbow (Hessel et al.) combines several DQN improvements: Double DQN (reduce overestimation), Dueling (value + advantage), PER (replay important transitions), Noisy nets (state-dependent exploration), multi-step returns (n-step learning), and optionally C51 (distributional RL). Together they improve sample efficiency and final performance on Atari. In practice, you do not need all components for every task; CartPole may be solved with vanilla DQN, while harder games benefit from the full stack. Implementing Rainbow is a capstone for the value-approximation volume. ...

March 10, 2026 · 3 min · 586 words · codefrydev

Chapter 61: The Hard Exploration Problem

Learning objectives Run DQN with ε-greedy on a sparse-reward environment (e.g. Montezuma’s Revenge if available, or a simple maze). Observe that the agent rarely discovers the first key (or goal) when rewards are sparse. Explain why sparse rewards cause failure: no learning signal until the goal is reached; random exploration is unlikely to reach it. Concept and real-world RL Hard exploration occurs when the reward is sparse (e.g. only at the goal): the agent gets no signal until it accidentally reaches the goal, which may require a long, specific sequence of actions. In game AI (Montezuma’s Revenge, Pitfall), ε-greedy DQN fails because random exploration almost never finds the key. In robot navigation and recommendation, sparse rewards (e.g. “user clicked” or “reached goal”) similarly make learning slow. This motivates intrinsic motivation, curiosity, and hierarchical methods. ...

March 10, 2026 · 3 min · 489 words · codefrydev

Function Approximation and Deep RL

This page covers function approximation and deep RL concepts you need for the preliminary assessment: why we need FA, the policy gradient update, exploration in DQN, experience replay, and the advantage of actor-critic. Back to Preliminary. Why this matters for RL In large or continuous state spaces we cannot store a value per state; we use a parameterized function (e.g. neural network) to approximate values or policies. That leads to policy gradient methods (maximize return) and value-based methods with FA (e.g. DQN). DQN uses experience replay and exploration (e.g. ε-greedy); actor-critic combines a policy (actor) and a value function (critic) for lower-variance policy gradients. You need to understand why FA is necessary and how these pieces fit together. ...

March 10, 2026 · 7 min · 1400 words · codefrydev

Phase 4 Deep RL Quiz

Use this quiz after completing Volumes 3–5 (or the Phase 4 coding challenges). If you can answer at least 9 of 12 correctly, you are ready for Phase 5 and Volume 6. 1. Function approximation Q: Why is function approximation necessary in RL for large or continuous state spaces? Answer Tabular methods store one value per state (or state-action); the number of states can be huge or infinite. Function approximation uses a parameterized function (e.g. neural network) so a fixed number of parameters represent values for all states and generalize from seen to unseen states. ...

March 10, 2026 · 4 min · 814 words · codefrydev