Chapter 71: The Offline RL Problem

Learning objectives Collect a dataset of transitions (state, action, reward, next_state, done) from a random policy (or fixed behavior policy) in the Hopper environment. Train a standard SAC agent offline (no environment interaction) on this dataset and observe the overestimation of Q-values for out-of-distribution (OOD) actions. Explain why naive off-policy methods fail in offline RL: the policy is trained to maximize Q, but Q is only trained on in-distribution actions; for OOD actions Q can be overestimated. Identify the distributional shift between the behavior policy (that collected the data) and the learned policy. Relate the offline RL problem to recommendation and healthcare where data comes from logs or historical trials. Concept and real-world RL ...

March 10, 2026 · 4 min · 723 words · codefrydev

Chapter 72: Conservative Q-Learning (CQL)

Learning objectives Implement the CQL loss: add a term that penalizes Q-values for actions drawn from the current policy (or a uniform distribution) so that Q is lower for out-of-distribution actions. Apply CQL to the offline dataset from Chapter 71 and train an offline SAC (or similar) with the CQL regularizer. Compare the learned policy’s evaluation return and Q-values with naive SAC on the same dataset. Explain why penalizing Q for OOD actions helps avoid overestimation and improves offline performance. Relate CQL to recommendation and healthcare where we must learn from fixed logs without overestimating unseen actions. Concept and real-world RL ...

March 10, 2026 · 4 min · 684 words · codefrydev

Chapter 73: Decision Transformers

Learning objectives Implement a Decision Transformer: a transformer (or GPT-style) model that takes sequences of (returns-to-go, state, action) and predicts actions conditioned on desired return (and past states/actions). Explain the formulation: at each timestep, input (R_t, s_t, a_{t-1}, R_{t-1}, s_{t-1}, …) where R_t is the return from t onward; the model predicts a_t. Training is supervised on offline trajectories. Train the model on a simple environment’s offline dataset and test by conditioning on different returns-to-go (e.g. high return for “expert” behavior). Compare with offline RL (e.g. CQL) in terms of implementation and how the policy is extracted (conditioning vs maximization). Relate Decision Transformers to recommendation (sequence of user-item-reward) and dialogue (conditioning on desired outcome). Concept and real-world RL ...

March 10, 2026 · 4 min · 716 words · codefrydev