The offline RL problem, Conservative Q-Learning (CQL), Decision Transformers, imitation learning, limitations of behavioral cloning, DAgger, inverse RL, GAIL, AMP, offline-to-online finetuning, and RLHF basics. Chapters 71–80.
Volume 8: Offline RL & Imitation Learning
Chapters 71–80 — Offline RL problem, CQL, Decision Transformers, behavioral cloning, DAgger, IRL, GAIL, AMP, offline-to-online, RLHF basics.
Random policy dataset on Hopper; naive SAC overestimation.
CQL loss penalizing Q for OOD actions; compare with naive SAC.
Decision Transformer: returns-to-go, states, actions; GPT-like predict actions.
Expert demos from PPO on LunarLander; behavioral cloning.
Covariate shift; DAgger: mix expert and BC, retrain.
Max-ent IRL: learn reward from expert; linear reward, forward RL.
Discriminator expert vs agent; use as reward for policy gradient.
AMP paper: task reward + adversarial style reward; combined reward.
Pretrain SAC offline; finetune online; Q-filter for bad actions.
Bradley-Terry from pairwise comparisons; train policy with PPO.
Review Volume 8 (Offline RL, Imitation Learning, IRL, RLHF) and preview Volume 9 (Multi-Agent RL — cooperation, competition, game theory).