Chapter 92: Safe Reinforcement Learning

Learning objectives Formulate a constrained MDP for a self-driving car (or similar): maximize progress (or reward) while keeping collisions (or another cost) below a threshold. Implement a Lagrangian method: add a penalty term λ * (constraint violation) to the objective and update the penalty coefficient λ (e.g. increase λ when the constraint is violated) so that the policy satisfies the constraint. Explain the trade-off: higher λ pushes the policy to satisfy the constraint but may reduce task reward; tune λ or use dual ascent. Evaluate the policy: report task return and constraint cost (e.g. number of collisions per episode); verify the constraint is met in evaluation. Relate safe RL and constrained MDPs to healthcare (safety constraints) and trading (risk limits). Concept and real-world RL ...

March 10, 2026 · 4 min · 707 words · codefrydev