Chapter 3: Markov Decision Processes (MDPs)
Learning objectives Define an MDP: states, actions, transition probabilities, and rewards. Write transition probability matrices \(P(s’ | s, a)\) for a small MDP. Recognize the Markov property: the next state and reward depend only on the current state and action. Concept and real-world RL A Markov Decision Process (MDP) is the standard mathematical model for RL: a set of states, a set of actions, transition probabilities \(P(s’, r | s, a)\), and a discount factor. The Markov property says that the future (next state and reward) depends only on the current state and action, not on earlier history. That allows us to plan using the current state alone. Real-world examples include board games (state = board position), robot navigation (state = position/velocity), and queue control (state = queue lengths). Writing out \(P\) and reward tables for a tiny MDP is the first step toward value iteration and policy iteration. ...