Chapter 6: The Bellman Equations
Learning objectives Derive the Bellman optimality equation for \(Q^*(s,a)\) from the definition of optimal action value. Contrast the optimality equation (max over actions) with the expectation equation (average over actions under \(\pi\)). Explain why the optimality equations are nonlinear and how algorithms (e.g. value iteration) handle them. Concept and real-world RL The optimal action-value function \(Q^(s,a)\) is the expected return from state \(s\), taking action \(a\), then acting optimally. The Bellman optimality equation for \(Q^\) states that \(Q^(s,a)\) equals the expected immediate reward plus \(\gamma\) times the maximum over next-state action values (not an average under a policy). This “max” makes the system nonlinear: the optimal policy is greedy with respect to \(Q^\), and \(Q^\) is the fixed point of this equation. Value iteration and Q-learning are built on this; in practice, we approximate \(Q^\) with tables or function approximators. ...