Chapter 72: Conservative Q-Learning (CQL)

Learning objectives

Implement the CQL loss: add a term that penalizes Q-values for actions drawn from the current policy (or a uniform distribution) so that Q is lower for out-of-distribution actions.
Apply CQL to the offline dataset from Chapter 71 and train an offline SAC (or similar) with the CQL regularizer.
Compare the learned policy’s evaluation return and Q-values with naive SAC on the same dataset.
Explain why penalizing Q for OOD actions helps avoid overestimation and improves offline performance.
Relate CQL to recommendation and healthcare where we must learn from fixed logs without overestimating unseen actions.

Concept and real-world RL

Conservative Q-Learning (CQL) modifies the Q-learning objective so that Q-values for out-of-distribution (OOD) actions are penalized (pushed down). The typical formulation adds a term that increases the Q-loss when Q(s, a) is large for actions a sampled from the current policy (or a uniform distribution), while keeping Q(s, a) accurate for (s, a) in the dataset. This reduces overestimation for actions the agent would take but that are under-represented in the data. In recommendation and healthcare, we need to learn from historical data without recommending or prescribing actions that the data does not support; CQL-style conservatism is one approach.

Where you see this in practice: CQL and related offline RL algorithms (e.g. BRAC, TD3+BC); safe policy learning from logs.

Illustration (CQL vs naive): CQL penalizes Q-values for OOD actions, reducing overestimation. The chart below compares mean Q after training (CQL vs naive SAC on same offline data).

Exercise: Implement the CQL loss by adding a term that penalizes Q-values for out-of-distribution actions. Apply it to the offline dataset from Chapter 71 and compare with naive SAC.

Professor’s hints

CQL term: E_s[ log sum_a exp(Q(s,a)) - E_a~π [ Q(s,a) ] ] or similar: encourage Q to be lower for policy actions relative to a log-sum-exp over actions. Often implemented as: add to critic loss α * (E[Q(s,a_π)] - E[Q(s,a_data)]), so we penalize Q when it is high for policy samples and reward (reduce loss) when Q is high for data actions. Check the CQL paper for the exact form; a simple version is to add α * mean(Q(s, a_random)) where a_random is from the current policy.
α (regularization weight): Tune α; too large and Q is too conservative (underestimate), too small and overestimation remains. Start with α=0.1–1.0 and sweep.
Use the same offline dataset as in Chapter 71 (random policy on Hopper) so you can directly compare CQL vs naive SAC evaluation return.
Keep the rest of SAC (actor loss, target network, etc.) unchanged; only modify the critic loss.

Common pitfalls

Wrong sign of the penalty: CQL should lower Q for OOD actions; double-check that your added term increases the loss when Q(s, π(s)) is large, so that the gradient step reduces Q for those actions.
Over-regularization: If α is too large, Q becomes too small everywhere and the policy may become too conservative (e.g. only choose actions that appear very often in the data). Monitor both Q and evaluation return.
Sampling OOD actions: You need to sample actions from the current policy (or a uniform distribution over actions) for the penalty; use the actor to generate a for each s in the batch.

Worked solution (warm-up: CQL)

Key idea: CQL (Conservative Q-Learning) adds a regularizer that lowers Q-values for actions not in the dataset (e.g. sample \(a\) from the current policy and minimize \(Q(s,a)\)). So the learned Q is conservative: it is high only for (s,a) pairs that appear in the data. The policy then prefers in-distribution actions and avoids overestimated OOD actions. This stabilizes offline RL.

Extra practice

Warm-up: In one sentence, why does penalizing Q(s, π(s)) during training help when the data was collected by a different policy?
Coding: Implement CQL with a simple penalty: loss += α * mean(Q(s, a_π)) where a_π ~ π(·|s). Train on the Chapter 71 offline Hopper dataset. Plot evaluation return vs α ∈ {0, 0.01, 0.1, 1.0}. Which α works best?
Challenge: Implement the full CQL loss from the paper (with log-sum-exp and data expectation). Compare sample efficiency and final return with your simplified penalty.
Variant: Use the same CQL code on a mixed dataset: 50% expert, 50% random transitions. Does CQL still outperform naive SAC? How does dataset composition change the optimal α?
Debug: CQL training is stable but evaluation return is worse than behavioral cloning. Logging shows α=10, causing Q-values to be uniformly near zero. The penalty is so strong that the Q-function can no longer distinguish good from bad actions. Describe how to tune α adaptively (e.g. keep Q on in-distribution actions ≥ some threshold).
Conceptual: CQL adds a penalty on Q(s, a) for out-of-distribution actions. Explain intuitively why this is equivalent to learning a lower bound on the true Q-function. When would a lower bound be too conservative and hurt performance?