Chapter 71: The Offline RL Problem
Learning objectives Collect a dataset of transitions (state, action, reward, next_state, done) from a random policy (or fixed behavior policy) in the Hopper environment. Train a standard SAC agent offline (no environment interaction) on this dataset and observe the overestimation of Q-values for out-of-distribution (OOD) actions. Explain why naive off-policy methods fail in offline RL: the policy is trained to maximize Q, but Q is only trained on in-distribution actions; for OOD actions Q can be overestimated. Identify the distributional shift between the behavior policy (that collected the data) and the learned policy. Relate the offline RL problem to recommendation and healthcare where data comes from logs or historical trials. Concept and real-world RL ...