Chapter 99: Debugging RL Code
Learning objectives Take a broken RL implementation (e.g. SAC that does not learn or converges to poor return) and diagnose the issue systematically. Write unit tests for the environment (e.g. step returns correct shapes, reset works, reward is bounded), the replay buffer (e.g. sample returns correct batch shape, storage and sampling are consistent), and gradient shapes (e.g. critic loss backward produces gradients of the right shape). Add logging for Q-values (min, max, mean), rewards (per step and per episode), and entropy (or log_prob) so you can spot numerical issues, collapse, or scale problems. Identify the root cause (e.g. wrong sign, wrong target, learning rate, or reward scale) and fix it. Relate debugging practice to robot navigation and healthcare where bugs can be costly. Concept and real-world RL ...