The offline RL problem, Conservative Q-Learning (CQL), Decision Transformers, imitation learning, limitations of behavioral cloning, DAgger, inverse RL, GAIL, AMP, offline-to-online finetuning, and RLHF basics. Chapters 71–80.
Learning objectives
Collect a dataset of transitions (state, action, reward, next_state, done) from a random policy (or fixed behavior policy) in the Hopper environment. Train a standard SAC agent offline (no environment interaction) on this dataset and observe the overestimation of Q-values for out-of-distribution (OOD) actions. Explain why naive off-policy methods fail in offline RL: the policy is trained to maximize Q, but Q is only trained on in-distribution actions; for OOD actions Q can be overestimated. Identify the distributional shift between the behavior policy (that collected the data) and the learned policy. Relate the offline RL problem to recommendation and healthcare where data comes from logs or historical trials. Concept and real-world RL
...
Learning objectives
Implement the CQL loss: add a term that penalizes Q-values for actions drawn from the current policy (or a uniform distribution) so that Q is lower for out-of-distribution actions. Apply CQL to the offline dataset from Chapter 71 and train an offline SAC (or similar) with the CQL regularizer. Compare the learned policy’s evaluation return and Q-values with naive SAC on the same dataset. Explain why penalizing Q for OOD actions helps avoid overestimation and improves offline performance. Relate CQL to recommendation and healthcare where we must learn from fixed logs without overestimating unseen actions. Concept and real-world RL
...
Learning objectives
Implement a Decision Transformer: a transformer (or GPT-style) model that takes sequences of (returns-to-go, state, action) and predicts actions conditioned on desired return (and past states/actions). Explain the formulation: at each timestep, input (R_t, s_t, a_{t-1}, R_{t-1}, s_{t-1}, …) where R_t is the return from t onward; the model predicts a_t. Training is supervised on offline trajectories. Train the model on a simple environment’s offline dataset and test by conditioning on different returns-to-go (e.g. high return for “expert” behavior). Compare with offline RL (e.g. CQL) in terms of implementation and how the policy is extracted (conditioning vs maximization). Relate Decision Transformers to recommendation (sequence of user-item-reward) and dialogue (conditioning on desired outcome). Concept and real-world RL
...
Learning objectives
Collect expert demonstrations (state-action pairs or trajectories) from a trained PPO agent on LunarLander. Train a behavioral cloning (BC) agent: supervised learning to predict the expert’s action given the state. Evaluate the BC policy in the environment and compare its return and behavior to the expert. Explain the assumptions of behavioral cloning (i.i.d. states from the expert distribution) and when it works well. Relate imitation learning to robot navigation (learning from human demos) and dialogue (learning from human responses). Concept and real-world RL
...
Learning objectives
Demonstrate the covariate shift problem: run the BC agent, record states it visits that were rare or absent in the expert data, and show that errors compound in those regions. Implement DAgger: collect new data by running the current BC policy (or a mix of expert and BC), query the expert for the correct action at those states, add to the dataset, and retrain BC. Explain why DAgger reduces covariate shift by adding on-policy (or mixed) states to the training set. Compare BC (trained only on expert data) with DAgger (iteratively aggregated) in terms of evaluation return and robustness. Relate covariate shift and DAgger to robot navigation and healthcare where the learner’s distribution can drift from the expert’s. Concept and real-world RL
...
Learning objectives
Implement maximum entropy IRL: given expert trajectories, learn a reward function such that the expert’s policy (approximately) maximizes expected return under that reward. Use a linear reward model (e.g. r(s, a) = w^T φ(s, a)) and forward RL (e.g. value iteration or policy gradient) to compute the optimal policy for the current reward. Iterate between updating the reward to make the expert look better than other policies and solving the forward RL problem. Explain why IRL can recover a reward that explains the expert behavior and then generalize (e.g. to new states) better than pure BC in some settings. Relate IRL to robot navigation (recovering intent from demonstrations) and healthcare (inferring treatment objectives). Concept and real-world RL
...
Learning objectives
Implement GAIL: train a discriminator D(s, a) to distinguish state-action pairs from the expert vs from the current policy; use the discriminator output (or log D) as reward for a policy gradient method. Train the policy to maximize the discriminator reward (i.e. to fool the discriminator) while the discriminator tries to tell expert from agent. Test on a simple task (e.g. CartPole or MuJoCo) and compare imitation quality with behavioral cloning. Explain the connection to GANs: the policy is the generator, the discriminator gives the learning signal. Relate GAIL to robot navigation and game AI where we have expert demos and want to match the expert distribution without hand-designed rewards. Concept and real-world RL
...
Learning objectives
Read the AMP paper and explain how it combines a task reward (e.g. velocity tracking, goal reaching) with an adversarial style reward (discriminator that scores motion similarity to reference data). Write the combined reward function: r = r_task + λ r_style, where r_style comes from a discriminator trained to distinguish agent motion from reference (e.g. motion capture) data. Identify why adding a style reward helps produce natural-looking and robust locomotion compared to task-only reward. Relate AMP to robot navigation and game AI (character animation) where we want both task success and natural motion. Concept and real-world RL
...
Learning objectives
Pretrain an SAC (or similar) agent offline on a fixed dataset (e.g. from a mix of policies or from Chapter 71). Finetune the agent online by continuing training with environment interaction. Compare the learning curve (return vs steps) of finetuning from offline pretraining vs training from scratch. Implement a Q-filter: when updating the policy, avoid or downweight updates that use actions for which Q is below a threshold (to avoid reinforcing “bad” actions that could destabilize the policy). Relate offline-to-online to recommendation (pretrain on logs, then A/B test) and healthcare (pretrain on historical data, then cautious online updates). Concept and real-world RL
...
Learning objectives
Implement a Bradley-Terry model to learn a reward function from pairwise comparisons of two trajectories (or segments): given (τ^w, τ^l) meaning “τ^w is preferred over τ^l,” fit r so that E[r(τ^w)] > E[r(τ^l)]. Use the learned reward to train a policy with PPO (or another policy gradient method): maximize expected return under r. Explain the RLHF pipeline: collect preferences → train reward model → train policy on reward model. Test on a simple environment with simulated preferences (e.g. prefer longer/higher-return trajectories) and verify the policy improves. Relate RLHF to dialogue (prefer helpful/harmless responses) and recommendation (prefer engaging content). Concept and real-world RL
...