Chapter 72: Conservative Q-Learning (CQL)
Learning objectives Implement the CQL loss: add a term that penalizes Q-values for actions drawn from the current policy (or a uniform distribution) so that Q is lower for out-of-distribution actions. Apply CQL to the offline dataset from Chapter 71 and train an offline SAC (or similar) with the CQL regularizer. Compare the learned policy’s evaluation return and Q-values with naive SAC on the same dataset. Explain why penalizing Q for OOD actions helps avoid overestimation and improves offline performance. Relate CQL to recommendation and healthcare where we must learn from fixed logs without overestimating unseen actions. Concept and real-world RL ...