Learning objectives
- Implement a simplified Dreamer-style algorithm: train an RSSM-like model on collected trajectories, then roll out in latent space to train an actor-critic.
- Understand the imagination phase: no real env steps; only latent rollouts for policy updates.
- Relate to robot control and sample-efficient RL.
Concept and real-world RL
Dreamer learns a recurrent state-space model (RSSM) in latent space: encode observation to latent, predict next latent given action, predict reward and continue. The actor-critic is trained on imagined rollouts (latent only), so many gradient steps use no real env interaction. In robot navigation and game AI, this yields high sample efficiency. The key is training the model and the policy on the same data so the latent space is useful for control.
Where you see this in practice: Dreamer v1/v2/v3; used in benchmarks and robotics research.
Illustration (Dreamer imagination): In the latent space, the model rolls out imagined trajectories to train the actor-critic. The chart below shows episode return (real env) over training with imagination updates.
Exercise: Implement a simplified version of the Dreamer algorithm’s “imagination” phase: train an RSSM-like model on collected trajectories, then roll out trajectories in the latent space to train an actor-critic.
Professor’s hints
- RSSM: encoder o_t → z_t; recurrent state h_t; transition (h_t, z_t, a_t) → h_{t+1}, z_{t+1}; reward and continue heads from (h,z). Train on (o,a,r) sequences with reconstruction and prediction losses.
- Imagination: from a latent (h,z) sampled from the buffer, rollout K steps using the model (no env). Compute lambda-return or GAE in latent space; update actor-critic on these imagined trajectories.
- Simplified: you can use a simpler dynamics model (e.g. MLP next latent) instead of full RSSM; the idea is the same.
Common pitfalls
- Distribution shift: Imagined states may diverge from real latent distribution. Dreamer uses short rollouts and reconditions on real data periodically.
- Value in latent space: The critic must be trained on imagined returns; use TD or GAE in the latent rollout.
Worked solution (warm-up: Dreamer)
Extra practice
- Warm-up: Why is training the actor on imagined rollouts sample-efficient?
- Coding: Train a simple latent model (encoder + next-latent predictor) on CartPole. Roll out 10 steps in latent space from real encodings. Compare imagined rewards with actual rewards from the env for the same actions.
- Challenge: Implement a full imagination phase: 50 latent steps, actor-critic update. Compare sample efficiency (return vs env steps) with PPO on CartPole.
- Variant: Vary the imagination horizon from 5 to 20 to 50 steps. Plot actor performance (real env return) vs imagination horizon. Is there a sweet spot, and does it shift with world-model quality?
- Debug: The Dreamer-style actor improves in imagination but real-world return stays flat. The world model is frozen after 10k steps and never updated further. Explain why a stale world model breaks the actor and what the correct training loop looks like.
- Conceptual: Why can imagination-based training fail in environments with multimodal or chaotic dynamics (e.g. contact-rich robotics)? What property must the latent world model have for imagined trajectories to be useful for policy improvement?