Chapter 79: Offline-to-Online Finetuning
Learning objectives Pretrain an SAC (or similar) agent offline on a fixed dataset (e.g. from a mix of policies or from Chapter 71). Finetune the agent online by continuing training with environment interaction. Compare the learning curve (return vs steps) of finetuning from offline pretraining vs training from scratch. Implement a Q-filter: when updating the policy, avoid or downweight updates that use actions for which Q is below a threshold (to avoid reinforcing “bad” actions that could destabilize the policy). Relate offline-to-online to recommendation (pretrain on logs, then A/B test) and healthcare (pretrain on historical data, then cautious online updates). Concept and real-world RL ...