RL in robotics, safe reinforcement learning, algorithmic trading, recommender systems, training LLMs with PPO, implementing RLHF, Direct Preference Optimization (DPO), evaluating RL agents, debugging RL code, and the future of RL. Chapters 91–100.
Volume 10: Real-World RL, Safety & Large Language Models
Chapters 91–100 — RL in robotics, safe RL, algorithmic trading, recommender systems, PPO for LLMs, RLHF, DPO, evaluating agents, debugging, future of RL.
Train in sim (e.g. arm reaching); domain randomization; sim-to-real.
Constrained MDP for self-driving; Lagrangian penalty.
Simple stock MDP: buy/sell/hold; profit reward; Sharpe ratio.
Toy recommender, 100 items, changing user; maximize engagement.
PPO fine-tune small LM (e.g. GPT-2) for sentiment; KL penalty.
Simulated preference data; Bradley-Terry reward model; PPO finetune.
DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.
PPO on 10 seeds; mean, std; rliable confidence intervals.
Broken SAC: unit tests, logging Q/reward/entropy; diagnose.
Essay: foundation models and RL; architectures; path toward AGI.