Throughout this curriculum we refer to anchor scenarios—concrete real-world settings where reinforcement learning is used. These help you see how each concept (MDPs, value functions, policy gradients, etc.) appears in practice. When you see a concept, ask: “How does this show up in robot navigation? In game AI? In recommendation?”


Anchor scenarios

ScenarioWhat it isState / action / reward (typical)Where it appears in the curriculum
Robot navigationA robot or agent moves in a physical or simulated space to reach a goal.State: position, velocity, sensor readings. Action: move, turn. Reward: +1 at goal, small cost per step or collision.Vol 1–2 (gridworld, MDPs, value iteration); Vol 3–5 (DQN, policy gradients for continuous control).
Game AIAn agent plays a game (board game, video game, card game) with rules and opponents.State: board position or game screen; action: move, play card; reward: win/loss or score.Vol 1 (return, discounting); Vol 2 (blackjack MC, Q-learning); Vol 3–4 (DQN, policy gradients); Vol 7 (exploration).
RecommendationA system suggests items (videos, products, articles) to users; goal is long-term engagement or satisfaction.State: user history, context; action: which item to show; reward: click, watch time, or purchase.Vol 1–2 (bandits, MDP for sequential decisions); Vol 8 (offline RL from logs); Vol 10 (real-world RL).
Trading / financeAn agent makes buy/sell/hold decisions in markets with uncertain outcomes.State: prices, portfolio, indicators; action: trade or hold; reward: profit, risk-adjusted return.Vol 1 (delayed reward, discounting); Vol 6 (model-based); Vol 10 (safety, real-world).
Healthcare / dosingDecisions about treatment, dosage, or interventions over time with safety constraints.State: patient history, vitals; action: dose or intervention; reward: outcome minus harm.Vol 1 (MDP, reward design); Vol 8 (offline RL from historical data); Vol 10 (safety, constraints).
Dialogue / assistantsAn agent (chatbot, voice assistant) chooses responses to maximize user satisfaction or task completion.State: conversation history, user intent; action: response or API call; reward: user feedback, task success.Vol 4–5 (policy gradients, PPO); Vol 10 (RLHF, LLMs).

How we use these in chapters

  • Concept and real-world RL: Each chapter ties the concept to at least one anchor (e.g. “In robot navigation, the state is (position, velocity); in recommendation, the state can be user history.”).
  • Where you see this in practice: Some chapters add a short list, e.g. “Used in: AlphaGo (MDP), industrial control (value iteration).”
  • Exercises: When an exercise is generic (e.g. “implement Q-learning”), you can re-use the same code on a gridworld (robot-like) or a simple game (game AI). The math is the same; the scenario gives context.

Quick reference by volume

  • Vol 1 (Mathematical foundations): Gridworld and bandits → robot navigation, game AI, recommendation (multi-armed bandits).
  • Vol 2 (Tabular methods): Blackjack, gridworld → game AI, robot navigation.
  • Vol 3 (Value approximation, DQN): CartPole, Atari-style → robot control, game AI.
  • Vol 4–5 (Policy gradients, PPO): CartPole, MuJoCo → robot control, dialogue, game AI.
  • Vol 6 (Model-based): Planning, Dreamer → robot navigation, trading.
  • Vol 7 (Exploration): Bandits, curiosity → recommendation, game AI.
  • Vol 8 (Offline / imitation): Batch data → recommendation, healthcare.
  • Vol 9 (Multi-agent): Multiple agents → game AI, trading, dialogue.
  • Vol 10 (Real-world, safety, LLMs): All anchors; safety for healthcare and finance; RLHF for dialogue.

Use these anchors to make the curriculum concrete. When in doubt, map the current chapter to one of the six and ask: “What is the state? The action? The reward?”