Chapter 24: Experience Replay

Learning objectives

Implement a replay buffer that stores transitions \((s, a, r, s’, \text{done})\) with a fixed capacity.
Use a circular buffer (overwrite oldest when full) and random sampling for minibatches.
Test the buffer with random data and verify shapes and sampling behavior.

Concept and real-world RL

Experience replay stores past transitions and samples random minibatches for training. It breaks the correlation between consecutive samples (which would cause unstable updates if we trained only on the last transition) and reuses data for sample efficiency. DQN and many off-policy algorithms rely on it. The buffer is usually a circular buffer: when full, new transitions overwrite the oldest. Sampling uniformly at random (or with prioritization in advanced variants) gives unbiased minibatches. In practice, buffer size is a hyperparameter (e.g. 10k–1M); too small limits diversity, too large uses more memory and can slow learning if the policy has changed a lot.

Illustration (buffer fill): As the agent interacts with the env, the replay buffer fills until it reaches capacity, then the oldest transitions are overwritten. The chart below shows how buffer size grows over the first 15k steps (capacity 10k).

Exercise: Code a replay buffer class with methods push(state, action, reward, next_state, done) and sample(batch_size). Ensure it uses a circular buffer and random sampling. Test it with random data.

Professor’s hints

Storage: use a list of tuples or a NumPy array. For a list, when len(buffer) >= capacity, pop the first (or use a deque with maxlen). For an array, keep an index pos and do buffer[pos % capacity] = (s, a, r, s', done); increment pos.
Sample: use random.sample(range(len(buffer)), min(batch_size, len(buffer))) to get indices, then return a list or batch of (s, a, r, s’, done). For batching with PyTorch, stack states into a tensor (batch, state_dim), etc.
Test: push 100 random transitions (e.g. state = random vector, action = 0 or 1, etc.). Sample 32; check that you get 32 transitions and that states/actions have the expected shapes. Push 1000 more and confirm old ones are dropped (if you track which were added when).

Common pitfalls

Sampling before buffer has batch_size: If len(buffer) < batch_size, sample only len(buffer) or wait until enough. Many implementations sample min(batch_size, len(buffer)) and the training loop checks len(buffer) >= batch_size before updating.
Mutable state: If you store references to state/next_state and the env reuses the same array, you can corrupt the buffer. Copy states when pushing: buffer.push(s.copy(), a, r, s_next.copy(), done) (or torch.clone for tensors).
Order: Circular buffer does not preserve order of insertion in the storage; that is fine. Sampling is random anyway.

Worked solution (warm-up: replay buffer length)

Warm-up: After pushing 50 transitions with capacity 100, what is len(buffer)? After pushing 150 total? Answer: After 50: len(buffer) = 50. After 150: with a circular buffer of capacity 100, we keep only the most recent 100, so len(buffer) = 100. The buffer is full and the oldest 50 have been overwritten. This is why we need a fixed capacity and an index that wraps; sampling is uniform over the current buffer contents.

Extra practice

Warm-up: After pushing 50 transitions with capacity 100, what is len(buffer)? After pushing 150 total, what is len(buffer)? (50, then 100 if circular.)
Coding: Implement a replay buffer class: push(transition), sample(batch_size) returning a batch of (s, a, r, s’, done). Use a circular buffer (deque or list with index). Test with 100 pushes and 32 samples.
Challenge: Add a method sample_sequential(n) that returns n consecutive transitions (e.g. for n-step learning or RNN). Ensure you do not cross an episode boundary (done=True) if you want full n-step returns.
Variant: Reduce capacity to 10 and push 50 transitions. What is the minimum and maximum age of a transition in the buffer? How does a very small buffer affect learning diversity?
Debug: The code below stores a reference to the state array instead of a copy, so all entries end up pointing to the same (most recently mutated) state. Fix it.

Try it — edit and run (Shift+Enter)

Conceptual: Why does random sampling from a replay buffer reduce the correlation between training samples? What happens if we train only on the most recent transitions?
Recall: Explain in one sentence why experience replay improves sample efficiency in off-policy RL.