This learning path takes you from zero programming experience to understanding and building reinforcement learning systems. Follow the phases in order; each phase builds on the previous one. The order matches the Course outline (basic to advanced).
- Real-world scenarios โ Six anchor scenarios (robot navigation, game AI, recommendation, trading, healthcare, dialogue) used throughout the curriculum so every concept is tied to practice.
Not ready for the Preliminary assessment? If you have never programmed, start with Phase 0. If the assessment feels hard, follow this learning path in order and return to it when you are ready.
Phase 0 โ Programming from zero
For: Anyone with no prior coding experience.
Duration: About 2โ4 weeks (at a few hours per week).
What you will do: Install Python, run your first script, and learn variables, types, conditionals, loops, and functions.
Outcomes:
- You can run a Python script and write a small program.
- You understand variables, conditionals, loops, and functions well enough to read simple code.
- You are ready for the full Python prerequisite.
In RL, this leads to: Every RL implementation is code. You will write loops over episodes, conditionals for exploration vs. exploitation, and functions for environments and agents.
Start here: Phase 0: Programming from zero
New in this curriculum:
- Python Confidence Builder โ 25 mini-challenges before Phase 1 (Level 0.5)
- RL in Plain English โ no math, no code intuition builder (Level 1.5)
- Bridge Exercises โ 15 problems combining Python + math + toy RL (Level 2.5)
Phase 1 โ Math foundations for RL
For: Learners who can write basic Python (or have finished Phase 0 and the Python prerequisite) and want to solidify the math used in RL.
Duration: About 2โ4 weeks.
What you will do: Study probability & statistics, linear algebra, and calculus with RL-motivated examples and practice. Work through the sub-phases in order, then take the self-check.
Sub-phases:
- 1a โ Probability: Probability & Statistics. Expectations, variance, sample mean, law of large numbers. In RL: bandit rewards, Monte Carlo returns, value functions as expectations.
- 1b โ Statistics: Statistics for RL. Mean, variance, standard deviation, standard error, histograms, correlation. In RL: analyzing episode returns, reporting results with error bars.
- 1c โ Linear algebra: Linear algebra. Vectors, dot product, matrices, gradients. In RL: state vectors, linear value approximation \(V(s) = w^T \phi(s)\), gradient updates.
- 1d โ Calculus: Calculus. Derivatives, chain rule, partial derivatives. In RL: policy gradients, loss minimization, backprop.
Outcomes:
- You can read RL notation (expectations, distributions, vectors, gradients).
- You can compute sample means, variances, dot products, and simple derivatives.
- You feel comfortable with the math that appears in the Preliminary assessment and in Volume 1.
In RL, this leads to: Value functions are expectations; states and observations are vectors; policy gradients use calculus. Solid math makes every chapter easier.
Start here: Math for RL โ then Phase 1 self-check
Phase 2 โ Prerequisites (tools and libraries)
For: Learners who have basic programming (and ideally some math) and are ready to use the stack the curriculum assumes.
Duration: About 3โ6 weeks, depending on how much you already know.
What you will do: Work through Python (full), NumPy, Pandas, Matplotlib, PyTorch, TensorFlow, and Gym as needed. Each prerequisite page explains why RL needs it; complete the one small task per topic listed on the Prerequisites index, then take the Phase 2 readiness quiz.
Outcomes:
- You can use the data structures, classes, and patterns used in RL code (trajectories, configs, buffers).
- You can create arrays, do batch operations, and plot results with NumPy and Matplotlib.
- You can define and train small neural networks with PyTorch (or TensorFlow) and run Gym environments.
In RL, this leads to: The curriculum exercises assume this stack. Prerequisites include Professor’s hints and common pitfalls to avoid mistakes.
Start here: Prerequisites โ Phase 2 readiness quiz
Phase 3 โ Math for RL (deep dive)
For: Learners who have completed Phases 0โ2 and want a deeper foundation before RL.
Duration: About 2โ3 weeks.
What you will do: Complete Math for RL โ probability, statistics, linear algebra, and calculus with RL-motivated examples. Work through the drills at the end of each math page.
Outcomes:
- You can interpret RL notation fluently.
- You understand why value functions are expectations, why gradients are used for optimization, and how to compute sample statistics from RL evaluation runs.
Start here: Math for RL index
Phase 4 โ ML Foundations
For: Learners who want to understand supervised learning before tackling deep RL.
Duration: About 3โ5 weeks.
What you will do: Complete ML Foundations โ supervised learning, linear/logistic regression, gradient descent, model evaluation, decision trees, and KNN. Take the checkpoint at the midpoint and the Phase 4 assessment at the end.
Outcomes:
- You understand how supervised learning trains models from labeled data using gradient descent.
- You can implement and evaluate linear regression, logistic regression, and simple classifiers.
- You know how to split data, evaluate with metrics (accuracy, precision, recall, F1), and avoid data leakage.
In RL, this leads to: Deep RL builds on these foundations. DQN uses supervised-style regression on Q-values; policy networks are trained with gradient descent. Understanding supervised learning makes deep RL much clearer.
Start here: ML Foundations โ ML Mid-Point Checkpoint โ Phase 4 assessment
Phase 5 โ DL Foundations
For: Learners who have completed ML Foundations and are ready for neural networks.
Duration: About 4โ6 weeks.
What you will do: Complete DL Foundations โ biological inspiration, perceptrons, MLP, backpropagation, loss functions, activations, optimizers, training loops, regularization, CNNs, PyTorch, and the mini-project. Take the DL mid-point checkpoint and Phase 5 assessment.
Outcomes:
- You can implement a full neural network from scratch in NumPy (forward pass, loss, backprop, gradient update).
- You understand how Adam, SGD, and Momentum work and when to use each.
- You can build a QNetwork and PolicyNetwork in PyTorch using
nn.Module. - You are ready to implement DQN and policy gradient methods.
In RL, this leads to: Everything in deep RL. DQN, REINFORCE, PPO, and actor-critic all use neural networks with the exact patterns you built here.
Start here: DL Foundations โ DL Mid-Point Checkpoint โ Phase 5 assessment
Phase 6 โ RL foundations
For: Learners who have completed (or tested out of) Phases 0โ5 and are ready for the core RL curriculum.
Duration: About 4โ8 weeks.
What you will do: Complete Volume 1: Mathematical Foundations and Volume 2: Tabular Methods & Classic Algorithms (chapters 1โ20). Use the milestone checkpoints and mini-project (tabular Q-learning on a 5ร5 Gridworld) on the Phase 6 page, then take the Phase 6 foundations quiz.
Outcomes:
- You understand the RL framework (agent, environment, state, action, reward), MDPs, and the Markov property.
- You can explain value functions, Bellman equations, and discounting.
- You understand and can implement Monte Carlo, TD, SARSA, and Q-learning in tabular settings.
In RL, this leads to: Everything that follows (DQN, policy gradients, etc.) builds on these ideas. Do not skip this phase.
Start here: Volume 1: Mathematical Foundations โ Phase 6 milestones & mini-project โ Phase 6 foundations quiz
Phase 7 โ Deep RL
For: Learners who have finished Volumes 1โ2 and want to scale to large or continuous state spaces.
Duration: About 6โ12 weeks.
What you will do: Complete Volume 3: Value Function Approximation & Deep Q-Learning, Volume 4: Policy Gradients, and Volume 5: Advanced Policy Optimization (chapters 21โ50).
Outcomes:
- You can implement and tune DQN-style methods (replay, target networks, etc.) and policy gradient methods (REINFORCE, actor-critic, PPO).
- You understand why function approximation is needed and how gradient-based updates work in RL.
In RL, this leads to: Most practical applications use deep RL. This phase is where you go from “understanding the theory” to “building agents that work in complex environments.”
Start here: Volume 3: Value Function Approximation & Deep Q-Learning โ Phase 7 milestones & coding challenges โ Phase 7 Deep RL quiz
Phase 8 โ Advanced topics
For: Learners who have completed Phases 6โ7 and want to go deeper.
Duration: Ongoing (pick topics as needed).
What you will do: Work through Volumes 6โ10 (chapters 51โ100): model-based RL, exploration, offline RL, multi-agent RL, real-world applications, safety, and RL with large language models. Each volume has a topic roadmap (what you will learn per chapter); use it to pick a path. An optional Phase 8 project (e.g. offline RL on a fixed dataset, or a simple multi-agent scenario) ties concepts together.
Topic roadmaps (after this you willโฆ):
- Vol 6 (Model-based): Compare model-free vs model-based; learn world models and compounding error; implement planning (BFS, MCTS), Dreamer-style imagination, MBPO, PETS.
- Vol 7 (Exploration & meta): Tackle hard exploration (sparse rewards); intrinsic motivation, curiosity (ICM), RND; Go-Explore; meta-learning (MAML, RLยฒ).
- Vol 8 (Offline & imitation): Understand offline RL (distribution shift, CQL); Decision Transformers; behavioral cloning, DAgger, IRL, GAIL; RLHF basics.
- Vol 9 (Multi-agent): Game theory basics; IQL, CTDE, MADDPG; VDN, QMIX; MAPPO; self-play; communication.
- Vol 10 (Real-world, safety, LLMs): Robotics and sim-to-real; safe RL; trading, recommenders; PPO/RLHF for LLMs; evaluation and debugging.
Optional project: Implement offline RL on a fixed dataset (e.g. from a random or expert policy) using CQL or conservative Q-learning; or implement a simple two-agent cooperative task with parameter sharing (MAPPO or IQL).
Outcomes:
- You can read RL papers and extend the project.
- You understand model-based methods, exploration, offline and imitation learning, MARL, and how RL is used in practice (robotics, trading, recommenders, RLHF).
In RL, this leads to: Research and industry applications. Use the curriculum as a map and dive into the areas that interest you most.
Start here: Volume 6: Model-Based RL & Planning
Quick reference
| Phase | Content | Duration (approx.) |
|---|---|---|
| 0 | Programming from zero | 2โ4 weeks |
| 1 | Math for RL | 2โ4 weeks |
| 2 | Prerequisites | 3โ6 weeks |
| 3 | Math for RL (deep dive) | 2โ3 weeks |
| 4 | ML Foundations | 3โ5 weeks |
| 5 | DL Foundations | 4โ6 weeks |
| 6 | Volume 1 + Volume 2 | 4โ8 weeks |
| 7 | Volumes 3โ5 | 6โ12 weeks |
| 8 | Volumes 6โ10 | Ongoing |
Good luck on your journey from zero to mastery.
Quick Reference
- Interactive phase modules (0โ8) โ Module-style hubs with progress and links into each phase (same UI as the Deep RL demo).
- Glossary โ 75 RL terms with definitions, chapter references, and examples
- Assessments โ Phase 0 through Phase 8, mid-point checkpoints
- Appendix: Debugging RL Code โ Common bugs and 5 find-the-bug exercises
- Appendix: Reading RL Papers โ How to read DQN, PPO, and SAC papers
- Interactive Lab โ Run Python in your browser (JupyterLite)