This page lists every topic in the intended order: from welcome and bandits through MDPs, dynamic programming, Monte Carlo, temporal difference, approximation methods, projects, and appendix. Follow this outline for a clear basic-to-advanced path. Each item links to the relevant curriculum chapter, prerequisite, or dedicated page.
Welcome
| Topic | Where to find it |
|---|---|
| Introduction | Home |
| Course Outline and Big Picture | This page |
| Where to get the Code | Dedicated page |
| How to Succeed in this Course | Dedicated page |
Warmup — Multi-Armed Bandit
| Topic | Where to find it |
|---|---|
| Section Introduction: The Explore-Exploit Dilemma | Chapter 2: Multi-Armed Bandits |
| Applications of the Explore-Exploit Dilemma | Chapter 2 |
| Epsilon-Greedy Theory | Chapter 2 |
| Calculating a Sample Mean (pt 1) | Math for RL: Probability |
| Epsilon-Greedy Beginner’s Exercise Prompt | Chapter 2 |
| Designing Your Bandit Program | Chapter 2 |
| Epsilon-Greedy in Code | Chapter 2 |
| Comparing Different Epsilons | Chapter 2 |
| Optimistic Initial Values Theory | Chapter 2 (hints); Bandits: Optimistic Initial Values |
| Optimistic Initial Values Beginner’s Exercise Prompt | Bandits: Optimistic Initial Values |
| Optimistic Initial Values Code | Bandits: Optimistic Initial Values |
| UCB1 Theory | Dedicated page |
| UCB1 Beginner’s Exercise Prompt | Bandits: UCB1 |
| UCB1 Code | Bandits: UCB1 |
| Bayesian Bandits / Thompson Sampling Theory (pt 1) | Dedicated page |
| Bayesian Bandits / Thompson Sampling Theory (pt 2) | Bandits: Thompson Sampling |
| Thompson Sampling Beginner’s Exercise Prompt | Bandits: Thompson Sampling |
| Thompson Sampling Code | Bandits: Thompson Sampling |
| Thompson Sampling With Gaussian Reward Theory | Bandits: Thompson Sampling |
| Thompson Sampling With Gaussian Reward Code | Bandits: Thompson Sampling |
| Exercise on Gaussian Rewards | Bandits: Thompson Sampling |
| Why don’t we just use a library? | Dedicated page |
| Nonstationary Bandits | Dedicated page |
| Bandit Summary, Real Data, and Online Learning | Chapter 2; Bandits: Nonstationary |
| (Optional) Alternative Bandit Designs | Chapter 2 |
High-Level Overview of Reinforcement Learning
| Topic | Where to find it |
|---|---|
| What is Reinforcement Learning? | Chapter 1 |
| From Bandits to Full Reinforcement Learning | Chapter 1, Chapter 2 |
| Markov Decision Processes | Chapter 3 |
MDP Section
| Topic | Where to find it |
|---|---|
| MDP Section Introduction | Chapter 3: MDPs |
| Gridworld | Dedicated page |
| Choosing Rewards | Dedicated page |
| The Markov Property | Chapter 3 |
| Markov Decision Processes (MDPs) | Chapter 3 |
| Future Rewards | Chapter 4: Reward Hypothesis, Chapter 5: Value Functions |
| Value Functions | Chapter 5 |
| The Bellman Equation (pt 1–3) | Chapter 6: The Bellman Equations |
| Bellman Examples | Chapter 6 |
| Optimal Policy and Optimal Value Function (pt 1–2) | Chapter 6 |
| MDP Summary | Chapter 3 – Chapter 6 |
Dynamic Programming
| Topic | Where to find it |
|---|---|
| Dynamic Programming Section Introduction | Volume 1 |
| Iterative Policy Evaluation | Chapter 7 |
| Designing Your RL Program | Chapter 7 |
| Gridworld in Code | Dedicated page |
| Iterative Policy Evaluation in Code | Dedicated page |
| Windy Gridworld | Dedicated page |
| Iterative Policy Evaluation for Windy Gridworld | Windy Gridworld |
| Policy Improvement | Chapter 8: Policy Iteration |
| Policy Iteration | Chapter 8 |
| Policy Iteration in Code | Chapter 8; DP code walkthrough |
| Policy Iteration in Windy Gridworld | Windy Gridworld |
| Value Iteration | Chapter 9 |
| Value Iteration in Code | Chapter 9; DP code walkthrough |
| Dynamic Programming Summary | Chapter 10: Limitations of DP |
Monte Carlo
| Topic | Where to find it |
|---|---|
| Monte Carlo Intro | Chapter 11 |
| Monte Carlo Policy Evaluation | Chapter 11 |
| Monte Carlo Policy Evaluation in Code | Dedicated page |
| Monte Carlo Control | Chapter 11 |
| Monte Carlo Control in Code | Monte Carlo in Code |
| Monte Carlo Control without Exploring Starts | Chapter 11; Monte Carlo in Code |
| Monte Carlo Control without Exploring Starts in Code | Monte Carlo in Code |
| Monte Carlo Summary | Chapter 11 |
Temporal Difference Learning
| Topic | Where to find it |
|---|---|
| Temporal Difference Introduction | Chapter 12 |
| TD(0) Prediction | Chapter 12 |
| TD(0) Prediction in Code | Dedicated page |
| SARSA | Chapter 13 |
| SARSA in Code | TD, SARSA, Q-Learning in Code |
| Q-Learning | Chapter 14 |
| Q-Learning in Code | TD, SARSA, Q-Learning in Code |
| TD Learning Section Summary | Chapter 12 – Chapter 14 |
Approximation Methods
| Topic | Where to find it |
|---|---|
| Approximation Methods Section Introduction | Volume 3 |
| Linear Models for Reinforcement Learning | Chapter 21 |
| Feature Engineering | Dedicated page |
| Approximation Methods for Prediction | Chapter 21 |
| Approximation Methods for Prediction Code | Chapter 21 |
| Approximation Methods for Control | Chapter 22 – Chapter 30 |
| Approximation Methods for Control Code | Volume 3 |
| CartPole | Dedicated page |
| CartPole Code | CartPole |
| Approximation Methods Exercise | Volume 3 chapters |
| Approximation Methods Section Summary | Volume 3 |
Interlude: Common Beginner Questions
| Topic | Where to find it |
|---|---|
| This Course vs. RL Book: What’s the Difference? | Dedicated page |
| Stock Trading Project with Reinforcement Learning | Dedicated section |
| Beginners, halt! Stop here if you skipped ahead | Stock Trading intro |
| Stock Trading Project Section Introduction | Stock Trading |
| Data and Environment | Stock Trading: Data and Environment |
| How to Model Q for Q-Learning | Stock Trading: How to Model Q |
| Design of the Program | Stock Trading: Design |
| Code pt 1–4 | Stock Trading |
| Stock Trading Project Discussion | Stock Trading |
Appendix / FAQ
| Topic | Where to find it |
|---|---|
| What is the Appendix? | Appendix index |
| Setting Up Your Environment | Dedicated page |
| Pre-Installation Check | Setting Up Your Environment |
| Anaconda Environment Setup | Dedicated page |
| How to install Numpy, Scipy, Matplotlib, Pandas, IPython, Theano, TensorFlow | Installing Libraries |
| How to Code by Yourself (part 1) | Dedicated page |
| How to Code by Yourself (part 2) | Dedicated page |
| Proof that using Jupyter Notebook is the same as not using it | Appendix |
| Python 2 vs Python 3 | Prerequisites: Python |
| Effective Learning Strategies | Dedicated page |
| How to Succeed in this Course (Long Version) | Dedicated page |
| Is this for Beginners or Experts? Academic or Practical? Pace | Dedicated page |
| Machine Learning and AI Prerequisite Roadmap (pt 1–2) | Dedicated page |
Part 2 — Advanced (Volumes 4–10)
After the topics above, the curriculum continues with 70 more chapters in order:
| Volume | Topics |
|---|---|
| Volume 4: Policy Gradients | Policy-based methods, REINFORCE, actor-critic, A2C, A3C, DDPG, TD3 (Ch 31–40) |
| Volume 5: Advanced Policy Optimization | TRPO, PPO, SAC, hyperparameter tuning (Ch 41–50) |
| Volume 6: Model-Based RL & Planning | World models, MCTS, AlphaZero, Dreamer, MBPO, PETS (Ch 51–60) |
| Volume 7: Exploration and Meta-Learning | Hard exploration, intrinsic motivation, RND, Go-Explore, MAML, RL² (Ch 61–70) |
| Volume 8: Offline RL & Imitation Learning | CQL, Decision Transformers, behavioral cloning, IRL, GAIL, RLHF (Ch 71–80) |
| Volume 9: Multi-Agent RL (MARL) | Game theory, IQL, CTDE, MADDPG, VDN, QMIX, MAPPO (Ch 81–90) |
| Volume 10: Real-World RL, Safety & LLMs | Robotics, safe RL, trading, recommenders, RLHF for LLMs, evaluation (Ch 91–100) |
See the full Curriculum for all 100 chapters.