Throughout this curriculum we refer to anchor scenarios—concrete real-world settings where reinforcement learning is used. These help you see how each concept (MDPs, value functions, policy gradients, etc.) appears in practice. When you see a concept, ask: “How does this show up in robot navigation? In game AI? In recommendation?”
Anchor scenarios
| Scenario | What it is | State / action / reward (typical) | Where it appears in the curriculum |
|---|---|---|---|
| Robot navigation | A robot or agent moves in a physical or simulated space to reach a goal. | State: position, velocity, sensor readings. Action: move, turn. Reward: +1 at goal, small cost per step or collision. | Vol 1–2 (gridworld, MDPs, value iteration); Vol 3–5 (DQN, policy gradients for continuous control). |
| Game AI | An agent plays a game (board game, video game, card game) with rules and opponents. | State: board position or game screen; action: move, play card; reward: win/loss or score. | Vol 1 (return, discounting); Vol 2 (blackjack MC, Q-learning); Vol 3–4 (DQN, policy gradients); Vol 7 (exploration). |
| Recommendation | A system suggests items (videos, products, articles) to users; goal is long-term engagement or satisfaction. | State: user history, context; action: which item to show; reward: click, watch time, or purchase. | Vol 1–2 (bandits, MDP for sequential decisions); Vol 8 (offline RL from logs); Vol 10 (real-world RL). |
| Trading / finance | An agent makes buy/sell/hold decisions in markets with uncertain outcomes. | State: prices, portfolio, indicators; action: trade or hold; reward: profit, risk-adjusted return. | Vol 1 (delayed reward, discounting); Vol 6 (model-based); Vol 10 (safety, real-world). |
| Healthcare / dosing | Decisions about treatment, dosage, or interventions over time with safety constraints. | State: patient history, vitals; action: dose or intervention; reward: outcome minus harm. | Vol 1 (MDP, reward design); Vol 8 (offline RL from historical data); Vol 10 (safety, constraints). |
| Dialogue / assistants | An agent (chatbot, voice assistant) chooses responses to maximize user satisfaction or task completion. | State: conversation history, user intent; action: response or API call; reward: user feedback, task success. | Vol 4–5 (policy gradients, PPO); Vol 10 (RLHF, LLMs). |
How we use these in chapters
- Concept and real-world RL: Each chapter ties the concept to at least one anchor (e.g. “In robot navigation, the state is (position, velocity); in recommendation, the state can be user history.”).
- Where you see this in practice: Some chapters add a short list, e.g. “Used in: AlphaGo (MDP), industrial control (value iteration).”
- Exercises: When an exercise is generic (e.g. “implement Q-learning”), you can re-use the same code on a gridworld (robot-like) or a simple game (game AI). The math is the same; the scenario gives context.
Quick reference by volume
- Vol 1 (Mathematical foundations): Gridworld and bandits → robot navigation, game AI, recommendation (multi-armed bandits).
- Vol 2 (Tabular methods): Blackjack, gridworld → game AI, robot navigation.
- Vol 3 (Value approximation, DQN): CartPole, Atari-style → robot control, game AI.
- Vol 4–5 (Policy gradients, PPO): CartPole, MuJoCo → robot control, dialogue, game AI.
- Vol 6 (Model-based): Planning, Dreamer → robot navigation, trading.
- Vol 7 (Exploration): Bandits, curiosity → recommendation, game AI.
- Vol 8 (Offline / imitation): Batch data → recommendation, healthcare.
- Vol 9 (Multi-agent): Multiple agents → game AI, trading, dialogue.
- Vol 10 (Real-world, safety, LLMs): All anchors; safety for healthcare and finance; RLHF for dialogue.
Use these anchors to make the curriculum concrete. When in doubt, map the current chapter to one of the six and ask: “What is the state? The action? The reward?”