Chapter 51: Model-Free vs. Model-Based RL

Learning objectives Compare model-free (e.g. PPO) and model-based (e.g. Dreamer) RL in terms of sample efficiency on a continuous control task like Walker. Explain why model-based methods can achieve more reward per real environment step (use of imagined rollouts). Identify trade-offs: model bias, computation, and implementation complexity. Concept and real-world RL Model-free methods learn a policy or value function directly from experience; model-based methods learn a dynamics model and use it for planning or imagined rollouts. Model-based RL can be more sample-efficient because each real transition can be reused many times in the model (short rollouts, planning). In robot navigation and trading, where real data is expensive, sample efficiency matters; in game AI, model-based methods (e.g. MuZero) combine learning and planning. The downside is model error (compounding over long rollouts) and extra computation. ...

March 10, 2026 · 3 min · 446 words · codefrydev

Chapter 56: MuZero Intuition

Learning objectives Read a MuZero paper summary and explain how MuZero learns a model in latent space without access to the true environment dynamics. Explain how MuZero handles reward prediction and value prediction in the latent space. Contrast with AlphaZero (which uses the true game rules). Concept and real-world RL MuZero learns a latent dynamics model: instead of predicting raw next state, it predicts the next latent state and (optionally) reward and value. So the “model” is learned end-to-end for the purpose of planning; it does not need to match the true state. This allows MuZero to work in video games and domains where rules are unknown. In game AI, MuZero achieves strong results on Atari and board games without hand-coded dynamics. ...

March 10, 2026 · 3 min · 468 words · codefrydev

Chapter 60: Visualizing Model-Based Rollouts

Learning objectives For a learned dynamics model (e.g. from Chapter 52), sample a starting state and generate a rollout of predicted states for a fixed action sequence. Plot the true states (from the environment) and the predicted states (from the model) on the same axes to visualize compounding error. Interpret the plot: where does the model diverge from reality? Concept and real-world RL Visualizing model rollouts vs real rollouts makes compounding error concrete: small 1-step errors accumulate and the predicted trajectory drifts. In robot navigation and model-based RL, this motivates short rollouts, ensemble methods, and uncertainty-aware planning. The same idea applies to trading models (predictions diverge over time) and dialogue (conversation dynamics). ...

March 10, 2026 · 3 min · 466 words · codefrydev