From the reinforcement learning framework and multi-armed bandits through MDPs, value functions, Bellman equations, and dynamic programming (policy evaluation, policy iteration, value iteration). Chapters 1–10.
Chapter 1: The Reinforcement Learning Framework
Learning objectives Identify the main components of an RL system: agent, environment, state, action, reward. Compute the discounted return for a sequence of rewards. Relate the gridworld to real tasks (e.g. navigation, games) where an agent gets delayed reward. Concept and real-world RL In reinforcement learning, an agent interacts with an environment: at each step the agent is in a state, chooses an action, and receives a reward and a new state. The return is the sum of (discounted) rewards along a trajectory; the agent’s goal is to maximize this return. A gridworld is a simple environment where states are cells and actions move the agent; it models robot navigation (e.g. a robot moving to a goal in a warehouse) and game AI (e.g. a character moving on a map). In robot navigation, the state might be (row, col); the action is up/down/left/right; the reward is +1 at the goal and often 0 or a small penalty per step. Discounting (\(\gamma < 1\)) makes future rewards worth less than immediate ones and keeps the return finite in long or infinite horizons. ...