Curriculum

Overall Progress 0%

Gridworld discounted return from a sequence of actions.

Full course outline in basic-to-advanced order. Every topic with links to curriculum, prerequisites, and learning path.

10-armed testbed with epsilon-greedy vs greedy.

Detailed advice on pacing, prerequisites, exercises, and staying motivated through the RL curriculum.

Using optimistic initial Q-values to encourage early exploration in multi-armed bandits.

Two-state MDP transition probability matrices.

Upper Confidence Bound (UCB1) algorithm for multi-armed bandits—balance exploration and exploitation using uncertainty.

Reward function for self-driving car and reward hacking.

The classic gridworld environment: states, actions, transitions, and terminal states.

Bayesian bandits and Thompson Sampling—sample from the posterior to balance exploration and exploitation.

State-value function V^π for random policy on Chapter 3 MDP.

How to design reward signals for MDPs and gridworld—shaping, terminal rewards, and step penalties.

When reward distributions change over time—exponential recency-weighted average and constant step size.

Derive Bellman optimality equation for Q*(s,a).

When to implement bandits from scratch vs. use existing libraries—learning goals and control.

Iterative policy evaluation on 4×4 gridworld.

Policy iteration and comparison with value iteration.

Gridworld with wind: actions are shifted by a wind effect. Theory and code for policy evaluation and policy iteration.

Value iteration on 4×4 gridworld, optimal V and policy.

Code walkthrough for gridworld, iterative policy evaluation, and policy iteration.

State and transition count for 10×10 gridworld; function approximation.

First-visit MC prediction for blackjack.

TD(0) prediction for blackjack; compare with Monte Carlo.

Code walkthrough for Monte Carlo policy evaluation and Monte Carlo control, with and without exploring starts.

SARSA on Cliff Walking; plot sum of rewards per episode.

Code walkthrough for TD(0) prediction, SARSA, and Q-learning (tabular).

Q-learning on Cliff Walking; compare with SARSA.

Expected SARSA vs Q-learning; variance and learning curves.

n-step SARSA (n=4) on Cliff Walking.

Custom 2D maze Gym env with text render.

Grid search over α and ε for Q-learning on Cliff Walking.

Memory for Backgammon Q-table; necessity of function approximation.

Linear FA with tile coding for MountainCar; semi-gradient SARSA.

Designing state and state-action features for linear value approximation.

The CartPole (Inverted Pendulum) environment: state, actions, and solving it with value-based or policy-based methods.

Two-hidden-layer PyTorch network for Q-values; MSE loss.

DQN for CartPole with replay and target network.

Replay buffer class with push and sample.

Hard vs soft target updates in DQN.

Double DQN: online selects, target evaluates; compare with DQN.

Dueling architecture V(s) + A(s,a); compare with DQN.

Sum-tree prioritized buffer with TD error; importance-sampling weights.

Noisy linear layers with factorized Gaussian; compare with ε-greedy.

Combine DDQN, Dueling, PER, NoisyNet, multi-step; train on Pong.

When a stochastic policy is essential; why deterministic fails.

Derive policy gradient theorem for one-step MDP.

REINFORCE for CartPole with softmax policy; note variance.

State-value baseline with REINFORCE; compare gradient variance.

Sketch two-network actor-critic; pseudocode for TD error updates.

A2C for CartPole with TD error as advantage; sync multi-env.

A3C with multiprocessing workers; compare speed with A2C.

Policy network for Pendulum: Gaussian mean and log-std; log-prob.

DDPG for Pendulum with OU noise and target networks.

TD3: clipped double Q, delayed policy, target smoothing.

Large step size and policy collapse in bandit; visualize probabilities.

TRPO constrained optimization and natural gradient; KL constraint.

Clipped surrogate objective; contrast with unclipped.

Generalized Advantage Estimation (GAE) function.

Full PPO for LunarLanderContinuous with GAE and rollout buffer.

Max-entropy objective; why entropy encourages exploration.

SAC for HalfCheetah with automatic temperature tuning.

Compare SAC and PPO on Hopper, Walker2d; when to choose which.

Custom 2D point mass with continuous action; test with SAC.

Weights & Biases sweep for SAC on custom env.

Compare Dreamer and PPO sample efficiency on Walker.

Train NN to predict next state from CartPole; compounding error.

BFS planner for gridworld; compare with DP.

MCTS for tic-tac-toe with UCT; play vs random.

Mini AlphaZero for tic-tac-toe: NN + MCTS, self-play.

MuZero: model in latent space; reward prediction.

Simplified Dreamer: RSSM, imagination phase, actor-critic.

MBPO: ensemble dynamics, short rollouts, SAC buffer.

Plot true vs predicted states; compounding error visualization.

DQN with ε-greedy on Montezuma's Revenge; sparse rewards.

State visitation count bonus; exploration in gridworld.

ICM: forward model, prediction error as intrinsic reward; A2C on maze.

RND: fixed target, predictor; prediction error as intrinsic reward.

Count-based with hash table; pseudo-counts with density model for images.

Simplified Go-Explore on deterministic maze; archive and return.

Task distribution (e.g. goal positions); meta-training loop, few-step adapt.

MAML for locomotion (e.g. different velocities); one-step adapt.

RNN policy with (state, action, reward, done) input; POMDP tasks.

Simple PAIRED: adversary designs maze, agent solves; train both.

Random policy dataset on Hopper; naive SAC overestimation.

CQL loss penalizing Q for OOD actions; compare with naive SAC.

Decision Transformer: returns-to-go, states, actions; GPT-like predict actions.

Expert demos from PPO on LunarLander; behavioral cloning.

Covariate shift; DAgger: mix expert and BC, retrain.

Max-ent IRL: learn reward from expert; linear reward, forward RL.

Discriminator expert vs agent; use as reward for policy gradient.

AMP paper: task reward + adversarial style reward; combined reward.

Pretrain SAC offline; finetune online; Q-filter for bad actions.

Bradley-Terry from pairwise comparisons; train policy with PPO.

Model Rock-Paper-Scissors as Dec-POMDP.

Nash equilibrium of 2×2 matrix; independent learning outcome.

IQL in cooperative meet-up game; non-stationarity.

Explain CTDE with example; why it helps non-stationarity.

MADDPG on simple spread; centralized critics, decentralized actors.

VDN: sum individual Q to joint Q; compare with IQL.

QMIX: mixing network, monotonicity via hypernetworks.

MAPPO with parameter sharing; centralized value; compare with IPPO.

Self-play in Tic-Tac-Toe; track ELO.

Agents output message + action; train for coordination task.

Train in sim (e.g. arm reaching); domain randomization; sim-to-real.

Constrained MDP for self-driving; Lagrangian penalty.

Simple stock MDP: buy/sell/hold; profit reward; Sharpe ratio.

Toy recommender, 100 items, changing user; maximize engagement.

PPO fine-tune small LM (e.g. GPT-2) for sentiment; KL penalty.

Simulated preference data; Bradley-Terry reward model; PPO finetune.

DPO loss from Bradley-Terry and KL-optimal policy; compare with PPO.

PPO on 10 seeds; mean, std; rliable confidence intervals.

Broken SAC: unit tests, logging Q/reward/entropy; diagnose.

Essay: foundation models and RL; architectures; path toward AGI.

You have mastered the foundations. Now, combine neural networks with RL for high-dimensional problems like Atari or robotics.

Short guide: follow the order, do the exercises, use the learning path and assessments.

How this curriculum relates to Sutton and Barto's Reinforcement Learning: An Introduction and other books.

Repository and code for the Reinforcement Learning curriculum exercises.

Where to find step-by-step solutions and worked examples across the RL curriculum site.

Browse all pages in the Reinforcement Learning Curriculum — chapters, prerequisites, assessments, and learning-path content.