RL in robotics, safe reinforcement learning, algorithmic trading, recommender systems, training LLMs with PPO, implementing RLHF, Direct Preference Optimization (DPO), evaluating RL agents, debugging RL code, and the future of RL. Chapters 91–100.
Chapter 91: RL in Robotics
Learning objectives Train a policy in simulation (e.g. robotic arm reaching or locomotion) using a standard RL algorithm (e.g. PPO or SAC). Apply domain randomization: vary physics parameters (e.g. mass, friction, motor gains) during training so the policy sees a distribution of sim environments. Attempt to deploy the policy in a real-world setting (or a different sim with “real” parameters) and evaluate the sim-to-real gap (drop in performance or need for adaptation). Explain why domain randomization can improve transfer: the policy becomes robust to parameter variation and may generalize to the real world. Relate sim-to-real and domain randomization to robot navigation and healthcare (safety-critical deployment). Concept and real-world RL ...