Reinforcement learning curriculum — volumes and 100 chapters

100 chapters from mathematical foundations to advanced RL, with one exercise per chapter.

Overall Progress 0%

Chapters 1–10 — RL framework, bandits, MDPs, reward hypothesis, value functions, Bellman equations, dynamic programming.

Key Topics to Learn

RL Framework: Agent, Environment, State, Action, Reward
Multi-Armed Bandits and Exploration vs. Exploitation
Markov Decision Processes (MDPs)
Value Functions V(s) and Q(s,a)
Bellman Equations and Dynamic Programming
Policy Evaluation, Policy Iteration, Value Iteration

Go to Volume 1: Mathematical Foundations →

Chapters 11–20 — Monte Carlo, TD, SARSA, Q-learning, Expected SARSA, n-step, Dyna-Q, custom Gym, hyperparameter tuning.

Key Topics to Learn

Monte Carlo Prediction and Control
Temporal Difference Learning (TD(0), TD(λ))
SARSA (on-policy) and Q-Learning (off-policy)
n-Step Bootstrapping
Planning with Dyna-Q
Custom Gymnasium Environments

Go to Volume 2: Tabular Methods & Classic Algorithms →

Chapters 21–30 — Linear FA, neural nets for RL, DQN, replay, target networks, DDQN, Dueling, PER, NoisyNet, Rainbow.

Key Topics to Learn

Linear Function Approximation
Neural Networks as Value Function Approximators
Deep Q-Networks (DQN): Replay Buffer and Target Network
Double DQN and Dueling DQN
Prioritized Experience Replay (PER)
Rainbow: Combining DQN Improvements

Go to Volume 3: Value Function Approximation & Deep Q-Learning →

Chapters 31–40 — Policy-based methods, REINFORCE, variance reduction, actor-critic, A2C, A3C, continuous actions, DDPG, TD3.

Key Topics to Learn

Policy Objective and REINFORCE
Variance Reduction with Baselines
Actor-Critic Architecture
Advantage Actor-Critic (A2C and A3C)
Continuous Action Spaces
DDPG and TD3

Go to Volume 4: Policy Gradients →

Chapters 41–50 — TRPO, PPO, GAE, PPO implementation, max entropy, SAC, SAC vs PPO, custom envs, hyperparameter tuning.

Key Topics to Learn

Trust Region Policy Optimization (TRPO)
Proximal Policy Optimization (PPO)
Generalized Advantage Estimation (GAE)
Maximum Entropy RL
Soft Actor-Critic (SAC)
Advanced Hyperparameter Tuning

Go to Volume 5: Advanced Policy Optimization →

Chapters 51–60 — Model-free vs model-based, world models, planning with known models, MCTS, AlphaZero, MuZero, Dreamer, MBPO, PETS.

Key Topics to Learn

Model-Free vs. Model-Based RL
Learning World Models
Planning: BFS, MCTS, AlphaZero
MuZero: Planning Without a Known Model
Dreamer: Latent Imagination
MBPO and PETS

Go to Volume 6: Model-Based RL & Planning →

Chapters 61–70 — Hard exploration, intrinsic motivation, ICM, RND, count-based, Go-Explore, meta-learning, MAML, RL², UED.

Key Topics to Learn

Hard Exploration and Sparse Rewards
Intrinsic Motivation and Curiosity (ICM)
Random Network Distillation (RND)
Count-Based Exploration
Go-Explore Algorithm
Meta-Learning: MAML and RL²

Go to Volume 7: Exploration and Meta-Learning →

Chapters 71–80 — Offline RL problem, CQL, Decision Transformers, behavioral cloning, DAgger, IRL, GAIL, AMP, offline-to-online, RLHF basics.

Key Topics to Learn

Offline RL Problem and Distribution Shift
Conservative Q-Learning (CQL)
Decision Transformers
Behavioral Cloning and DAgger
Inverse RL and GAIL
RLHF Basics

Go to Volume 8: Offline RL & Imitation Learning →

Chapters 81–90 — Multi-agent fundamentals, game theory, IQL, CTDE, MADDPG, VDN, QMIX, MAPPO, self-play, communication.

Key Topics to Learn

Multi-Agent Fundamentals and Game Theory
Independent Q-Learning (IQL)
Centralized Training, Decentralized Execution (CTDE)
MADDPG, VDN, and QMIX
Multi-Agent PPO (MAPPO)
Self-Play, League Training, and Communication

Go to Volume 9: Multi-Agent RL (MARL) →

Chapters 91–100 — RL in robotics, safe RL, algorithmic trading, recommender systems, PPO for LLMs, RLHF, DPO, evaluating agents, debugging, future of RL.

Key Topics to Learn

RL in Robotics and Sim-to-Real Transfer
Safe Reinforcement Learning
Algorithmic Trading and Recommender Systems
Training LLMs with PPO and RLHF
Direct Preference Optimization (DPO)
Evaluating and Debugging RL Agents

Go to Volume 10: Real-World RL, Safety & Large Language Models →

Ten volumes, 100 chapters—each with an exercise to reinforce the material. Start with Volume 1: Mathematical Foundations and work through in order, or jump to a volume that matches your level.

Volume 1: Mathematical Foundations — Chapters 1–10
Volume 2: Tabular Methods & Classic Algorithms — Chapters 11–20
Volume 3: Value Function Approximation & Deep Q-Learning — Chapters 21–30
Volume 4: Policy Gradients — Chapters 31–40
Volume 5: Advanced Policy Optimization — Chapters 41–50
Volume 6: Model-Based RL & Planning — Chapters 51–60
Volume 7: Exploration and Meta-Learning — Chapters 61–70
Volume 8: Offline RL & Imitation Learning — Chapters 71–80
Volume 9: Multi-Agent RL (MARL) — Chapters 81–90
Volume 10: Real-World RL, Safety & Large Language Models — Chapters 91–100