Offline RL

Random policy dataset on Hopper; naive SAC overestimation.

CQL loss penalizing Q for OOD actions; compare with naive SAC.

Decision Transformer: returns-to-go, states, actions; GPT-like predict actions.

Review Volume 7 (Exploration, ICM, RND, Go-Explore, Meta-RL) and preview Volume 8 (Offline RL, Imitation Learning, RLHF).

Review Volume 8 (Offline RL, Imitation Learning, IRL, RLHF) and preview Volume 9 (Multi-Agent RL — cooperation, competition, game theory).