Chapter 59: Probabilistic Ensembles with Trajectory Sampling (PETS)

Learning objectives

Implement PETS: an ensemble of probabilistic dynamics models (e.g. output mean and variance), and trajectory sampling (e.g. random shooting or CEM) to select actions via model predictive control (MPC).
Use the model to evaluate action sequences and pick the best (no policy network).
Apply to a continuous control task and compare with a policy-based method.

Concept and real-world RL

PETS uses an ensemble of probabilistic models to capture uncertainty; then at each step it samples many action sequences, rolls them out in the model, and chooses the sequence with the best predicted return (MPC). No policy network is trained; action selection is planning at test time. In robot control, MPC with learned models is used when we can afford computation at deployment; in trading, short-horizon planning with a learned model can improve decisions.

Where you see this in practice: PETS (Chua et al.); robotics MPC with learned models.

Illustration (PETS return): PETS uses an ensemble of dynamics models and trajectory sampling for MPC. The chart below shows typical return vs env steps on a continuous task.

Exercise: Implement PETS for a continuous control task. Use an ensemble of probabilistic neural networks, and use trajectory sampling (e.g., random shooting) with the model to select actions via MPC.

Professor’s hints

Probabilistic model: output (mean, var) for next state (and reward). Train with negative log-likelihood. Ensemble: train K models; for rollout, sample one model per trajectory or sample from the ensemble prediction.
Random shooting: sample N action sequences (e.g. H=10 steps, each action random or from a prior). Roll out each in the model; compute predicted sum of rewards. Pick the sequence with highest sum; execute first action (or first few); replan.
CEM: alternatively, iteratively refine a distribution over action sequences by keeping the elite fraction and resampling.

Common pitfalls

Curse of dimensionality: Random shooting over long horizons and high action dims is inefficient. Use short horizon (5–15) or CEM/cross-entropy method to focus samples.
Model uncertainty: The ensemble gives different predictions; use the mean or sample one model per trajectory for consistency.

Worked solution (warm-up: ensemble model)

Key idea: An ensemble of models gives a distribution over next state (and reward). We can use the mean prediction, or sample one model per rollout to get diverse imagined trajectories. Uncertainty (variance across models) can be used to avoid long rollouts in uncertain regions (e.g. only plan where models agree). This reduces the risk of compounding error from an overconfident wrong model.

Extra practice

Warm-up: What is the difference between PETS and MBPO in terms of how the model is used (planning vs training a policy)?
Coding: Implement PETS for Pendulum with horizon H=10 and 500 random action sequences per step. Plot return vs step. How does it compare to SAC with the same number of env steps?
Challenge: Replace random shooting with CEM: maintain a Gaussian over action sequences; sample, evaluate, keep top 10%; update Gaussian; repeat for 5 iterations. Does CEM improve over random shooting?
Variant: Vary the planning horizon H ∈ {5, 10, 20} in PETS. How does planning quality (return per step) change? At what horizon does computation become prohibitive?
Debug: PETS with random shooting performs well on average but has very high variance per step. The action sequences are all sampled independently. What change to the sampling strategy (e.g. time-correlated noise, action smoothing) would reduce this variance?
Conceptual: PETS replans at every timestep from scratch. How does this differ from policy-based methods that amortize planning into a neural network? What is the trade-off in online computation and adaptability to model changes?