Learning objectives
- Compare model-free (e.g. PPO) and model-based (e.g. Dreamer) RL in terms of sample efficiency on a continuous control task like Walker.
- Explain why model-based methods can achieve more reward per real environment step (use of imagined rollouts).
- Identify trade-offs: model bias, computation, and implementation complexity.
Concept and real-world RL
Model-free methods learn a policy or value function directly from experience; model-based methods learn a dynamics model and use it for planning or imagined rollouts. Model-based RL can be more sample-efficient because each real transition can be reused many times in the model (short rollouts, planning). In robot navigation and trading, where real data is expensive, sample efficiency matters; in game AI, model-based methods (e.g. MuZero) combine learning and planning. The downside is model error (compounding over long rollouts) and extra computation.
Where you see this in practice: Dreamer, MBPO, and MuZero are used in benchmarks; PPO/SAC remain standard when simplicity and robustness matter.
Illustration (sample efficiency): Model-based methods (e.g. Dreamer) often reach a given return in fewer env steps than model-free PPO. The chart below shows typical mean return vs env steps.
Exercise: Compare the sample efficiency of a model-based method (e.g., Dreamer) and a model-free method (e.g., PPO) on a task like Walker. Explain why model-based methods can be more sample-efficient.
Professor’s hints
- Run both for the same number of real env steps (e.g. 100k). Plot return vs steps. Dreamer typically uses many model rollouts per real step; PPO uses only real data.
- Sample efficiency: which method reaches a given return (e.g. 500) in fewer real steps? Model-based can do better by learning from imagined data.
- Explain: the model generates synthetic transitions, so the agent effectively gets more “experience” per real sample.
Common pitfalls
- Comparing by wall-clock time: Model-based often does more compute per step; compare by env steps (or report both).
- Different hyperparameters: Use reasonable defaults for each; document so the comparison is fair.
Worked solution (warm-up: why a learned model helps)
Extra practice
- Warm-up: In one sentence, why can a learned model improve sample efficiency?
- Coding: Run PPO on Walker2d for 200k steps and Dreamer (or MBPO) for 200k steps. Plot mean return vs steps. Which reaches 1000 first?
- Challenge: Vary the model rollout length in the model-based method. How does very long rollout length affect performance (think compounding error)?
- Variant: Repeat the comparison with a lower-dimensional environment (e.g. Pendulum). Does the model-based advantage shrink or grow? Why might model learning be easier in low-dimensional spaces?
- Debug: A student writes: “I train the world model once at the start and then only update the policy.” Identify the flaw and describe when this static-model approach would fail badly.
- Conceptual: When is model-based RL not preferred over model-free? Name two conditions (e.g. highly stochastic dynamics, non-stationary reward) that make learning a good model expensive or unreliable.
- Recall: Write down the two components that a typical world model must predict, and state what loss function is used for each.