Chapter 98: Evaluating RL Agents

Learning objectives

Train a PPO agent on 10 different random seeds and collect final returns (or mean return over the last N episodes) for each seed.
Compute the mean and standard deviation of these returns and report them (e.g. “mean ± std”).
Compute stratified confidence intervals (e.g. using the rliable library or similar) so that intervals account for both within-run and across-run variance.
Interpret the results: what does the interval tell us about the agent’s performance and reliability? Why is reporting only mean ± std over seeds often insufficient?
Relate evaluation practice to robot navigation, healthcare, and trading where reliable performance estimates matter.

Concept and real-world RL

Evaluating RL agents requires multiple runs (seeds) because training is stochastic; a single run can be lucky or unlucky. Reporting mean ± std over seeds is common but can be misleading: the std captures between-seed variance, while within-seed variance (e.g. across evaluation episodes) also matters. Stratified or rliable-style confidence intervals use both sources of variance to produce intervals that have correct coverage (e.g. 95% CI that contains the true mean performance with probability 0.95). In robot navigation, healthcare, and trading, we need reliable estimates before deployment.

Where you see this in practice: rliable library; evaluation in RL papers (multiple seeds, confidence intervals); reporting standards for RL benchmarks.

Illustration (confidence intervals): Over 10 seeds, mean return has variance; reporting mean ± std or stratified CIs is standard. The chart below shows mean and std of final return over seeds.

Exercise: Train a PPO agent on 10 different random seeds. Compute the mean and standard deviation of final returns. Then compute stratified confidence intervals using the rliable library. Interpret the results.

Professor’s hints

Setup: Same env (e.g. CartPole or MuJoCo), same PPO hyperparameters; only the random seed changes (for env, policy init, and any sampling). Train each seed until “done” (e.g. 500k steps or 1M).
Final return: For each seed, take the mean return over the last 50 (or 100) evaluation episodes (no exploration). So you have 10 numbers: one per seed.
Mean and std: mean(R), std(R) over the 10 seeds. Report “mean ± std” and optionally “median ± MAD.”
rliable: Install rliable; use their functions to compute confidence intervals (e.g. stratified bootstrap or interval_overlapping_runs). Input: matrix of returns (runs × evaluation_episodes) or (runs × 1) if you only have one number per run. Output: interval and point estimate. Interpret: “We are 95% confident that the true mean return lies in [L, U].”
Interpretation: Briefly explain why we need multiple seeds and why a confidence interval is more informative than mean ± std (e.g. std underestimates uncertainty when we have few seeds).

Common pitfalls

Too few seeds: 10 is a minimum; for papers, 5–10 seeds are common but more is better. With 3 seeds, intervals are very wide.
Evaluation episodes: Use a fixed number of eval episodes per seed (e.g. 50) and no exploration (deterministic policy or ε=0) so returns are comparable.
rliable API: Check the library docs for the exact function (e.g. get_interval or aggregate_metrics); the input format may be a matrix (runs × episodes).

Worked solution (warm-up: evaluation and CIs)

Key idea: We run \(N\) seeds and get returns per episode. We report mean and a confidence interval (e.g. 95% CI via bootstrap or standard error). The rliable library provides stratified bootstrap CIs that are more robust for RL (skewed, heavy-tailed returns). So we can say “algorithm A: 100 ± 10 (95% CI [82, 118])” and compare algorithms with statistical rigor. Always report multiple seeds; one run is not enough.

Extra practice

Warm-up: Why might “mean ± std over 5 seeds” be misleading when each seed’s return is the mean of 50 evaluation episodes?
Coding: Train PPO on CartPole for 5 seeds, 200k steps each. For each seed, record mean return over last 50 eval episodes. Compute mean ± std. Then use rliable (or bootstrap by seed) to get a 95% CI. Report both. How much wider is the CI than mean ± std?
Challenge: Compare interval estimation across algorithms: train PPO and SAC (or DQN) each on 10 seeds. Compute 95% CIs for both. Do the intervals overlap? What can you conclude about which algorithm is better?
Variant: Reduce the number of seeds from 10 to 3. How much do the 95% CIs widen? At what seed count do the CIs become so wide that the comparison between PPO and DQN is inconclusive?
Debug: A paper reports “PPO achieves 95% of human-level performance on 10 Atari games.” The metric is mean normalized score across games, but 3 games have very high variance (scores range from 0 to 10,000). These high-variance games dominate the mean. Describe how to use median normalized score and interquartile mean (IQM) to produce a more robust comparison, and why these metrics are recommended by rliable.
Conceptual: Reproducibility in RL is challenging because results depend on hyperparameters, code implementations, and random seeds. Describe the minimum reporting standard you would apply to an RL paper: what statistics, how many seeds, and which evaluation protocol (e.g. best checkpoint vs final checkpoint). Why does the choice of evaluation protocol matter as much as the algorithm itself?