Chapter 24: Experience Replay
Learning objectives Implement a replay buffer that stores transitions \((s, a, r, s’, \text{done})\) with a fixed capacity. Use a circular buffer (overwrite oldest when full) and random sampling for minibatches. Test the buffer with random data and verify shapes and sampling behavior. Concept and real-world RL Experience replay stores past transitions and samples random minibatches for training. It breaks the correlation between consecutive samples (which would cause unstable updates if we trained only on the last transition) and reuses data for sample efficiency. DQN and many off-policy algorithms rely on it. The buffer is usually a circular buffer: when full, new transitions overwrite the oldest. Sampling uniformly at random (or with prioritization in advanced variants) gives unbiased minibatches. In practice, buffer size is a hyperparameter (e.g. 10k–1M); too small limits diversity, too large uses more memory and can slow learning if the policy has changed a lot. ...