PPO

Learning objectives Train a PPO agent on 10 different random seeds and collect final returns (or mean return over the last N episodes) for each seed. Compute the mean and standard deviation of these returns and report them (e.g. “mean ± std”). Compute stratified confidence intervals (e.g. using the rliable library or similar) so that intervals account for both within-run and across-run variance. Interpret the results: what does the interval tell us about the agent’s performance and reliability? Why is reporting only mean ± std over seeds often insufficient? Relate evaluation practice to robot navigation, healthcare, and trading where reliable performance estimates matter. Concept and real-world RL ...

Chapter 98: Evaluating RL Agents

Phase 4 Deep RL Quiz