Chapter 58: Model-Based Policy Optimization (MBPO)
Learning objectives Implement MBPO: learn an ensemble of dynamics models, generate short rollouts from real states, add imagined transitions to the replay buffer, and train SAC on the combined buffer. Compare sample efficiency with SAC alone (same number of real env steps). Explain why short rollouts (e.g. 1–5 steps) help avoid compounding error. Concept and real-world RL MBPO (Model-Based Policy Optimization) uses learned dynamics to augment the replay buffer: from a real state, rollout the model for a few steps and add (s, a, r, s’) to the buffer. SAC (or another off-policy method) then trains on real + imagined data. Short rollouts keep model error manageable. In robot control and trading, MBPO can significantly reduce the number of real steps needed to reach good performance. ...