Chapter 51: Model-Free vs. Model-Based RL

Learning objectives

Compare model-free (e.g. PPO) and model-based (e.g. Dreamer) RL in terms of sample efficiency on a continuous control task like Walker.
Explain why model-based methods can achieve more reward per real environment step (use of imagined rollouts).
Identify trade-offs: model bias, computation, and implementation complexity.

Concept and real-world RL

Model-free methods learn a policy or value function directly from experience; model-based methods learn a dynamics model and use it for planning or imagined rollouts. Model-based RL can be more sample-efficient because each real transition can be reused many times in the model (short rollouts, planning). In robot navigation and trading, where real data is expensive, sample efficiency matters; in game AI, model-based methods (e.g. MuZero) combine learning and planning. The downside is model error (compounding over long rollouts) and extra computation.

Where you see this in practice: Dreamer, MBPO, and MuZero are used in benchmarks; PPO/SAC remain standard when simplicity and robustness matter.

Illustration (sample efficiency): Model-based methods (e.g. Dreamer) often reach a given return in fewer env steps than model-free PPO. The chart below shows typical mean return vs env steps.

Exercise: Compare the sample efficiency of a model-based method (e.g., Dreamer) and a model-free method (e.g., PPO) on a task like Walker. Explain why model-based methods can be more sample-efficient.

Professor’s hints

Run both for the same number of real env steps (e.g. 100k). Plot return vs steps. Dreamer typically uses many model rollouts per real step; PPO uses only real data.
Sample efficiency: which method reaches a given return (e.g. 500) in fewer real steps? Model-based can do better by learning from imagined data.
Explain: the model generates synthetic transitions, so the agent effectively gets more “experience” per real sample.

Common pitfalls

Comparing by wall-clock time: Model-based often does more compute per step; compare by env steps (or report both).
Different hyperparameters: Use reasonable defaults for each; document so the comparison is fair.

Worked solution (warm-up: why a learned model helps)

Warm-up: A learned model lets us generate simulated transitions without touching the real environment. We can do many updates (planning) per real step, so we extract more learning signal from each sample. That improves sample efficiency: we need fewer real env steps to reach the same performance. The trade-off is model error (compounding over long rollouts) and extra compute per step.

Extra practice

Warm-up: In one sentence, why can a learned model improve sample efficiency?
Coding: Run PPO on Walker2d for 200k steps and Dreamer (or MBPO) for 200k steps. Plot mean return vs steps. Which reaches 1000 first?
Challenge: Vary the model rollout length in the model-based method. How does very long rollout length affect performance (think compounding error)?
Variant: Repeat the comparison with a lower-dimensional environment (e.g. Pendulum). Does the model-based advantage shrink or grow? Why might model learning be easier in low-dimensional spaces?
Debug: A student writes: “I train the world model once at the start and then only update the policy.” Identify the flaw and describe when this static-model approach would fail badly.
Conceptual: When is model-based RL not preferred over model-free? Name two conditions (e.g. highly stochastic dynamics, non-stationary reward) that make learning a good model expensive or unreliable.
Recall: Write down the two components that a typical world model must predict, and state what loss function is used for each.