Curriculum

Gridworld discounted return from a sequence of actions.

Go to Chapter 1: The Reinforcement Learning Framework →

Full course outline in basic-to-advanced order. Every topic with links to curriculum, prerequisites, and learning path.

Go to Course Outline →

10-armed testbed with epsilon-greedy vs greedy.

Go to Chapter 2: Multi-Armed Bandits →

Detailed advice on pacing, prerequisites, exercises, and staying motivated through the RL curriculum.

Go to How to Succeed in this Course (Long Version) →

Using optimistic initial Q-values to encourage early exploration in multi-armed bandits.

Go to Bandits: Optimistic Initial Values →

Two-state MDP transition probability matrices.

Go to Chapter 3: Markov Decision Processes (MDPs) →

Upper Confidence Bound (UCB1) algorithm for multi-armed bandits—balance exploration and exploitation using uncertainty.

Go to Bandits: UCB1 →

Reward function for self-driving car and reward hacking.

Go to Chapter 4: The Reward Hypothesis →

The classic gridworld environment: states, actions, transitions, and terminal states.

Go to Gridworld →

Bayesian bandits and Thompson Sampling—sample from the posterior to balance exploration and exploitation.

Go to Bandits: Thompson Sampling →

State-value function V^π for random policy on Chapter 3 MDP.

Go to Chapter 5: Value Functions →

How to design reward signals for MDPs and gridworld—shaping, terminal rewards, and step penalties.

Go to Choosing Rewards →

When reward distributions change over time—exponential recency-weighted average and constant step size.

Go to Bandits: Nonstationary →

Derive Bellman optimality equation for Q*(s,a).

Go to Chapter 6: The Bellman Equations →

When to implement bandits from scratch vs. use existing libraries—learning goals and control.

Go to Bandits: Why don't we just use a library? →

Iterative policy evaluation on 4×4 gridworld.

Go to Chapter 7: Dynamic Programming — Policy Evaluation →

Policy iteration and comparison with value iteration.

Go to Chapter 8: Dynamic Programming — Policy Iteration →

Gridworld with wind: actions are shifted by a wind effect. Theory and code for policy evaluation and policy iteration.

Go to Windy Gridworld →

Value iteration on 4×4 gridworld, optimal V and policy.

Go to Chapter 9: Dynamic Programming — Value Iteration →

Code walkthrough for gridworld, iterative policy evaluation, and policy iteration.

Go to Dynamic Programming: Gridworld in Code →

State and transition count for 10×10 gridworld; function approximation.

Go to Chapter 10: Limitations of Dynamic Programming →

First-visit MC prediction for blackjack.

Go to Chapter 11: Monte Carlo Methods →

TD(0) prediction for blackjack; compare with Monte Carlo.

Go to Chapter 12: Temporal Difference (TD) Learning →

Code walkthrough for Monte Carlo policy evaluation and Monte Carlo control, with and without exploring starts.

Go to Monte Carlo in Code →

SARSA on Cliff Walking; plot sum of rewards per episode.

Go to Chapter 13: SARSA (On-Policy TD Control) →

Code walkthrough for TD(0) prediction, SARSA, and Q-learning (tabular).

Go to TD, SARSA, and Q-Learning in Code →

Q-learning on Cliff Walking; compare with SARSA.

Go to Chapter 14: Q-Learning (Off-Policy TD Control) →

Expected SARSA vs Q-learning; variance and learning curves.

Go to Chapter 15: Expected SARSA →

n-step SARSA (n=4) on Cliff Walking.

Go to Chapter 16: N-Step Bootstrapping →

Dyna-Q on 4×4 deterministic gridworld.

Go to Chapter 17: Planning and Learning with Tabular Methods →

Custom 2D maze Gym env with text render.

Go to Chapter 18: Custom Gym Environments (Part 1) →

Grid search over α and ε for Q-learning on Cliff Walking.

Go to Chapter 19: Hyperparameter Tuning in Tabular RL →

Memory for Backgammon Q-table; necessity of function approximation.

Go to Chapter 20: The Limits of Tabular Methods →

Linear FA with tile coding for MountainCar; semi-gradient SARSA.

Go to Chapter 21: Linear Function Approximation →

Designing state and state-action features for linear value approximation.

Go to Feature Engineering for Reinforcement Learning →

The CartPole (Inverted Pendulum) environment: state, actions, and solving it with value-based or policy-based methods.

Go to CartPole →

Two-hidden-layer PyTorch network for Q-values; MSE loss.

Go to Chapter 22: Artificial Neural Networks for RL →

DQN for CartPole with replay and target network.

Go to Chapter 23: Deep Q-Networks (DQN) →

Replay buffer class with push and sample.

Go to Chapter 24: Experience Replay →

Hard vs soft target updates in DQN.

Go to Chapter 25: Target Networks →

Double DQN: online selects, target evaluates; compare with DQN.

Go to Chapter 26: Double DQN (DDQN) →

Dueling architecture V(s) + A(s,a); compare with DQN.

Go to Chapter 27: Dueling DQN →

Sum-tree prioritized buffer with TD error; importance-sampling weights.

Go to Chapter 28: Prioritized Experience Replay (PER) →

Noisy linear layers with factorized Gaussian; compare with ε-greedy.

Go to Chapter 29: Noisy Networks for Exploration →

Combine DDQN, Dueling, PER, NoisyNet, multi-step; train on Pong.

Go to Chapter 30: Rainbow DQN →

When a stochastic policy is essential; why deterministic fails.

Go to Chapter 31: Introduction to Policy-Based Methods →

Derive policy gradient theorem for one-step MDP.

Go to Chapter 32: The Policy Objective Function →

REINFORCE for CartPole with softmax policy; note variance.

Go to Chapter 33: The REINFORCE Algorithm →

State-value baseline with REINFORCE; compare gradient variance.

Go to Chapter 34: Reducing Variance in Policy Gradients →

Sketch two-network actor-critic; pseudocode for TD error updates.

Go to Chapter 35: Actor-Critic Architectures →

A2C for CartPole with TD error as advantage; sync multi-env.

Go to Chapter 36: Advantage Actor-Critic (A2C) →

A3C with multiprocessing workers; compare speed with A2C.

Go to Chapter 37: Asynchronous Advantage Actor-Critic (A3C) →

Policy network for Pendulum: Gaussian mean and log-std; log-prob.

Go to Chapter 38: Continuous Action Spaces →

DDPG for Pendulum with OU noise and target networks.

Go to Chapter 39: Deep Deterministic Policy Gradient (DDPG) →

TD3: clipped double Q, delayed policy, target smoothing.

Go to Chapter 40: Twin Delayed DDPG (TD3) →

Large step size and policy collapse in bandit; visualize probabilities.

Go to Chapter 41: The Problem with Standard Policy Gradients →

TRPO constrained optimization and natural gradient; KL constraint.

Go to Chapter 42: Trust Region Policy Optimization (TRPO) →

Clipped surrogate objective; contrast with unclipped.

Go to Chapter 43: Proximal Policy Optimization (PPO): Intuition →

Generalized Advantage Estimation (GAE) function.

Go to Chapter 44: PPO: Implementation Details →

Full PPO for LunarLanderContinuous with GAE and rollout buffer.

Go to Chapter 45: Coding PPO from Scratch →

Max-entropy objective; why entropy encourages exploration.

Go to Chapter 46: Maximum Entropy RL →

SAC for HalfCheetah with automatic temperature tuning.

Go to Chapter 47: Soft Actor-Critic (SAC) →

Compare SAC and PPO on Hopper, Walker2d; when to choose which.

Go to Chapter 48: SAC vs. PPO →

Custom 2D point mass with continuous action; test with SAC.

Go to Chapter 49: Custom Gym Environments (Part 2) →

Weights & Biases sweep for SAC on custom env.

Go to Chapter 50: Advanced Hyperparameter Tuning →

Compare Dreamer and PPO sample efficiency on Walker.

Go to Chapter 51: Model-Free vs. Model-Based RL →

Train NN to predict next state from CartPole; compounding error.

Go to Chapter 52: Learning World Models →

BFS planner for gridworld; compare with DP.

Go to Chapter 53: Planning with Known Models →

MCTS for tic-tac-toe with UCT; play vs random.

Go to Chapter 54: Monte Carlo Tree Search (MCTS) →

Mini AlphaZero for tic-tac-toe: NN + MCTS, self-play.

Go to Chapter 55: AlphaZero Architecture →

MuZero: model in latent space; reward prediction.

Go to Chapter 56: MuZero Intuition →

Simplified Dreamer: RSSM, imagination phase, actor-critic.

Go to Chapter 57: Dreamer and Latent Imagination →

MBPO: ensemble dynamics, short rollouts, SAC buffer.

Go to Chapter 58: Model-Based Policy Optimization (MBPO) →

PETS: ensemble dynamics, MPC with random shooting.

Go to Chapter 59: Probabilistic Ensembles with Trajectory Sampling (PETS) →

Plot true vs predicted states; compounding error visualization.

Go to Chapter 60: Visualizing Model-Based Rollouts →

DQN with ε-greedy on Montezuma's Revenge; sparse rewards.

Go to Chapter 61: The Hard Exploration Problem →

State visitation count bonus; exploration in gridworld.

Go to Chapter 62: Intrinsic Motivation →

ICM: forward model, prediction error as intrinsic reward; A2C on maze.

Go to Chapter 63: Curiosity-Driven Exploration (ICM) →

RND: fixed target, predictor; prediction error as intrinsic reward.

Go to Chapter 64: Random Network Distillation (RND) →

Count-based with hash table; pseudo-counts with density model for images.

Go to Chapter 65: Count-Based Exploration →

Simplified Go-Explore on deterministic maze; archive and return.

Go to Chapter 66: Go-Explore Algorithm →

Task distribution (e.g. goal positions); meta-training loop, few-step adapt.

Go to Chapter 67: Meta-Learning (Learning to Learn) →

MAML for locomotion (e.g. different velocities); one-step adapt.

Go to Chapter 68: Model-Agnostic Meta-Learning (MAML) in RL →

RNN policy with (state, action, reward, done) input; POMDP tasks.

Go to Chapter 69: RL² (Reinforcement Learning as an RNN) →

Simple PAIRED: adversary designs maze, agent solves; train both.

Go to Chapter 70: Unsupervised Environment Design →

Random policy dataset on Hopper; naive SAC overestimation.

Go to Chapter 71: The Offline RL Problem →

CQL loss penalizing Q for OOD actions; compare with naive SAC.

Go to Chapter 72: Conservative Q-Learning (CQL) →

Decision Transformer: returns-to-go, states, actions; GPT-like predict actions.

Go to Chapter 73: Decision Transformers →

Expert demos from PPO on LunarLander; behavioral cloning.

Go to Chapter 74: Introduction to Imitation Learning →

Covariate shift; DAgger: mix expert and BC, retrain.

Go to Chapter 75: Limitations of Behavioral Cloning →

Max-ent IRL: learn reward from expert; linear reward, forward RL.

Go to Chapter 76: Inverse Reinforcement Learning (IRL) →

Discriminator expert vs agent; use as reward for policy gradient.

Go to Chapter 77: Generative Adversarial Imitation Learning (GAIL) →

AMP paper: task reward + adversarial style reward; combined reward.

Go to Chapter 78: Adversarial Motion Priors (AMP) →

Pretrain SAC offline; finetune online; Q-filter for bad actions.

Go to Chapter 79: Offline-to-Online Finetuning →

Bradley-Terry from pairwise comparisons; train policy with PPO.

Go to Chapter 80: RL from Human Feedback (RLHF) Basics →

Model Rock-Paper-Scissors as Dec-POMDP.

Go to Chapter 81: Multi-Agent Fundamentals →

Nash equilibrium of 2×2 matrix; independent learning outcome.

Go to Chapter 82: Game Theory Basics for RL →

IQL in cooperative meet-up game; non-stationarity.

Go to Chapter 83: Independent Q-Learning (IQL) →

Explain CTDE with example; why it helps non-stationarity.

Go to Chapter 84: Centralized Training, Decentralized Execution (CTDE) →

MADDPG on simple spread; centralized critics, decentralized actors.

Go to Chapter 85: Multi-Agent DDPG (MADDPG) →

VDN: sum individual Q to joint Q; compare with IQL.

Go to Chapter 86: Value Decomposition Networks (VDN) →

QMIX: mixing network, monotonicity via hypernetworks.

Go to Chapter 87: QMIX Algorithm →

MAPPO with parameter sharing; centralized value; compare with IPPO.

Go to Chapter 88: Multi-Agent PPO (MAPPO) →

Self-play in Tic-Tac-Toe; track ELO.

Go to Chapter 89: Self-Play and League Training →

Agents output message + action; train for coordination task.

Go to Chapter 90: Communication in MARL →

Train in sim (e.g. arm reaching); domain randomization; sim-to-real.

Go to Chapter 91: RL in Robotics →

Constrained MDP for self-driving; Lagrangian penalty.

Go to Chapter 92: Safe Reinforcement Learning →

Simple stock MDP: buy/sell/hold; profit reward; Sharpe ratio.

Go to Chapter 93: RL for Algorithmic Trading →

Toy recommender, 100 items, changing user; maximize engagement.

Go to Chapter 94: RL in Recommender Systems →

PPO fine-tune small LM (e.g. GPT-2) for sentiment; KL penalty.

Go to Chapter 95: Training Large Language Models with PPO →

Simulated preference data; Bradley-Terry reward model; PPO finetune.

Go to Chapter 96: Implementing RLHF in NLP →

DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.

Go to Chapter 97: Direct Preference Optimization (DPO) →

PPO on 10 seeds; mean, std; rliable confidence intervals.

Go to Chapter 98: Evaluating RL Agents →

Broken SAC: unit tests, logging Q/reward/entropy; diagnose.

Go to Chapter 99: Debugging RL Code →

Essay: foundation models and RL; architectures; path toward AGI.

Go to Chapter 100: The Future of Reinforcement Learning →

You have mastered the foundations. Now, combine neural networks with RL for high-dimensional problems like Atari or robotics.

Go to Deep Reinforcement Learning (module view) →

Short guide: follow the order, do the exercises, use the learning path and assessments.

Go to How to Succeed in this Course →

How this curriculum relates to Sutton and Barto's Reinforcement Learning: An Introduction and other books.

Go to This Course vs. RL Book: What's the Difference? →

Repository and code for the Reinforcement Learning curriculum exercises.

Go to Where to Get the Code →

Where to find step-by-step solutions and worked examples across the RL curriculum site.

Go to Worked Solutions Index →

Browse all pages in the Reinforcement Learning Curriculum — chapters, prerequisites, assessments, and learning-path content.

Go to Site archive — full page index for the RL curriculum →