Curriculum
Gridworld discounted return from a sequence of actions.
Full course outline in basic-to-advanced order. Every topic with links to curriculum, prerequisites, and learning path.
10-armed testbed with epsilon-greedy vs greedy.
Detailed advice on pacing, prerequisites, exercises, and staying motivated through the RL curriculum.
Using optimistic initial Q-values to encourage early exploration in multi-armed bandits.
Two-state MDP transition probability matrices.
Upper Confidence Bound (UCB1) algorithm for multi-armed bandits—balance exploration and exploitation using uncertainty.
Reward function for self-driving car and reward hacking.
The classic gridworld environment: states, actions, transitions, and terminal states.
Bayesian bandits and Thompson Sampling—sample from the posterior to balance exploration and exploitation.
State-value function V^π for random policy on Chapter 3 MDP.
How to design reward signals for MDPs and gridworld—shaping, terminal rewards, and step penalties.
When reward distributions change over time—exponential recency-weighted average and constant step size.
Derive Bellman optimality equation for Q*(s,a).
When to implement bandits from scratch vs. use existing libraries—learning goals and control.
Iterative policy evaluation on 4×4 gridworld.
Policy iteration and comparison with value iteration.
Gridworld with wind: actions are shifted by a wind effect. Theory and code for policy evaluation and policy iteration.
Value iteration on 4×4 gridworld, optimal V and policy.
Code walkthrough for gridworld, iterative policy evaluation, and policy iteration.
State and transition count for 10×10 gridworld; function approximation.
First-visit MC prediction for blackjack.
TD(0) prediction for blackjack; compare with Monte Carlo.
Code walkthrough for Monte Carlo policy evaluation and Monte Carlo control, with and without exploring starts.
SARSA on Cliff Walking; plot sum of rewards per episode.
Code walkthrough for TD(0) prediction, SARSA, and Q-learning (tabular).
Q-learning on Cliff Walking; compare with SARSA.
Expected SARSA vs Q-learning; variance and learning curves.
n-step SARSA (n=4) on Cliff Walking.
Dyna-Q on 4×4 deterministic gridworld.
Custom 2D maze Gym env with text render.
Grid search over α and ε for Q-learning on Cliff Walking.
Memory for Backgammon Q-table; necessity of function approximation.
Linear FA with tile coding for MountainCar; semi-gradient SARSA.
Designing state and state-action features for linear value approximation.
The CartPole (Inverted Pendulum) environment: state, actions, and solving it with value-based or policy-based methods.
Two-hidden-layer PyTorch network for Q-values; MSE loss.
DQN for CartPole with replay and target network.
Replay buffer class with push and sample.
Hard vs soft target updates in DQN.
Double DQN: online selects, target evaluates; compare with DQN.
Dueling architecture V(s) + A(s,a); compare with DQN.
Sum-tree prioritized buffer with TD error; importance-sampling weights.
Noisy linear layers with factorized Gaussian; compare with ε-greedy.
Combine DDQN, Dueling, PER, NoisyNet, multi-step; train on Pong.
When a stochastic policy is essential; why deterministic fails.
Derive policy gradient theorem for one-step MDP.
REINFORCE for CartPole with softmax policy; note variance.
State-value baseline with REINFORCE; compare gradient variance.
Sketch two-network actor-critic; pseudocode for TD error updates.
A2C for CartPole with TD error as advantage; sync multi-env.
A3C with multiprocessing workers; compare speed with A2C.
Policy network for Pendulum: Gaussian mean and log-std; log-prob.
DDPG for Pendulum with OU noise and target networks.
TD3: clipped double Q, delayed policy, target smoothing.
Large step size and policy collapse in bandit; visualize probabilities.
TRPO constrained optimization and natural gradient; KL constraint.
Clipped surrogate objective; contrast with unclipped.
Generalized Advantage Estimation (GAE) function.
Full PPO for LunarLanderContinuous with GAE and rollout buffer.
Max-entropy objective; why entropy encourages exploration.
SAC for HalfCheetah with automatic temperature tuning.
Compare SAC and PPO on Hopper, Walker2d; when to choose which.
Custom 2D point mass with continuous action; test with SAC.
Weights & Biases sweep for SAC on custom env.
Compare Dreamer and PPO sample efficiency on Walker.
Train NN to predict next state from CartPole; compounding error.
BFS planner for gridworld; compare with DP.
MCTS for tic-tac-toe with UCT; play vs random.
Mini AlphaZero for tic-tac-toe: NN + MCTS, self-play.
MuZero: model in latent space; reward prediction.
Simplified Dreamer: RSSM, imagination phase, actor-critic.
MBPO: ensemble dynamics, short rollouts, SAC buffer.
PETS: ensemble dynamics, MPC with random shooting.
Plot true vs predicted states; compounding error visualization.
DQN with ε-greedy on Montezuma's Revenge; sparse rewards.
State visitation count bonus; exploration in gridworld.
ICM: forward model, prediction error as intrinsic reward; A2C on maze.
RND: fixed target, predictor; prediction error as intrinsic reward.
Count-based with hash table; pseudo-counts with density model for images.
Simplified Go-Explore on deterministic maze; archive and return.
Task distribution (e.g. goal positions); meta-training loop, few-step adapt.
MAML for locomotion (e.g. different velocities); one-step adapt.
RNN policy with (state, action, reward, done) input; POMDP tasks.
Simple PAIRED: adversary designs maze, agent solves; train both.
Random policy dataset on Hopper; naive SAC overestimation.
CQL loss penalizing Q for OOD actions; compare with naive SAC.
Decision Transformer: returns-to-go, states, actions; GPT-like predict actions.
Expert demos from PPO on LunarLander; behavioral cloning.
Covariate shift; DAgger: mix expert and BC, retrain.
Max-ent IRL: learn reward from expert; linear reward, forward RL.
Discriminator expert vs agent; use as reward for policy gradient.
AMP paper: task reward + adversarial style reward; combined reward.
Pretrain SAC offline; finetune online; Q-filter for bad actions.
Bradley-Terry from pairwise comparisons; train policy with PPO.
Model Rock-Paper-Scissors as Dec-POMDP.
Nash equilibrium of 2×2 matrix; independent learning outcome.
IQL in cooperative meet-up game; non-stationarity.
Explain CTDE with example; why it helps non-stationarity.
MADDPG on simple spread; centralized critics, decentralized actors.
VDN: sum individual Q to joint Q; compare with IQL.
QMIX: mixing network, monotonicity via hypernetworks.
MAPPO with parameter sharing; centralized value; compare with IPPO.
Self-play in Tic-Tac-Toe; track ELO.
Agents output message + action; train for coordination task.
Train in sim (e.g. arm reaching); domain randomization; sim-to-real.
Constrained MDP for self-driving; Lagrangian penalty.
Simple stock MDP: buy/sell/hold; profit reward; Sharpe ratio.
Toy recommender, 100 items, changing user; maximize engagement.
PPO fine-tune small LM (e.g. GPT-2) for sentiment; KL penalty.
Simulated preference data; Bradley-Terry reward model; PPO finetune.
DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.
PPO on 10 seeds; mean, std; rliable confidence intervals.
Broken SAC: unit tests, logging Q/reward/entropy; diagnose.
Essay: foundation models and RL; architectures; path toward AGI.
You have mastered the foundations. Now, combine neural networks with RL for high-dimensional problems like Atari or robotics.
Short guide: follow the order, do the exercises, use the learning path and assessments.
How this curriculum relates to Sutton and Barto's Reinforcement Learning: An Introduction and other books.
Repository and code for the Reinforcement Learning curriculum exercises.
Where to find step-by-step solutions and worked examples across the RL curriculum site.
Browse all pages in the Reinforcement Learning Curriculum — chapters, prerequisites, assessments, and learning-path content.