Chapter 1: The Reinforcement Learning Framework

Learning objectives Identify the main components of an RL system: agent, environment, state, action, reward. Compute the discounted return for a sequence of rewards. Relate the gridworld to real tasks (e.g. navigation, games) where an agent gets delayed reward. Concept and real-world RL In reinforcement learning, an agent interacts with an environment: at each step the agent is in a state, chooses an action, and receives a reward and a new state. The return is the sum of (discounted) rewards along a trajectory; the agent’s goal is to maximize this return. A gridworld is a simple environment where states are cells and actions move the agent; it models robot navigation (e.g. a robot moving to a goal in a warehouse) and game AI (e.g. a character moving on a map). In robot navigation, the state might be (row, col); the action is up/down/left/right; the reward is +1 at the goal and often 0 or a small penalty per step. Discounting (\(\gamma < 1\)) makes future rewards worth less than immediate ones and keeps the return finite in long or infinite horizons. ...

March 10, 2026 · 4 min · 748 words · codefrydev

Course Outline

This page lists every topic in the intended order: from welcome and bandits through MDPs, dynamic programming, Monte Carlo, temporal difference, approximation methods, projects, and appendix. Follow this outline for a clear basic-to-advanced path. Each item links to the relevant curriculum chapter, prerequisite, or dedicated page. Welcome Topic Where to find it Introduction Home Course Outline and Big Picture This page Where to get the Code Dedicated page How to Succeed in this Course Dedicated page Warmup — Multi-Armed Bandit Topic Where to find it Section Introduction: The Explore-Exploit Dilemma Chapter 2: Multi-Armed Bandits Applications of the Explore-Exploit Dilemma Chapter 2 Epsilon-Greedy Theory Chapter 2 Calculating a Sample Mean (pt 1) Math for RL: Probability Epsilon-Greedy Beginner’s Exercise Prompt Chapter 2 Designing Your Bandit Program Chapter 2 Epsilon-Greedy in Code Chapter 2 Comparing Different Epsilons Chapter 2 Optimistic Initial Values Theory Chapter 2 (hints); Bandits: Optimistic Initial Values Optimistic Initial Values Beginner’s Exercise Prompt Bandits: Optimistic Initial Values Optimistic Initial Values Code Bandits: Optimistic Initial Values UCB1 Theory Dedicated page UCB1 Beginner’s Exercise Prompt Bandits: UCB1 UCB1 Code Bandits: UCB1 Bayesian Bandits / Thompson Sampling Theory (pt 1) Dedicated page Bayesian Bandits / Thompson Sampling Theory (pt 2) Bandits: Thompson Sampling Thompson Sampling Beginner’s Exercise Prompt Bandits: Thompson Sampling Thompson Sampling Code Bandits: Thompson Sampling Thompson Sampling With Gaussian Reward Theory Bandits: Thompson Sampling Thompson Sampling With Gaussian Reward Code Bandits: Thompson Sampling Exercise on Gaussian Rewards Bandits: Thompson Sampling Why don’t we just use a library? Dedicated page Nonstationary Bandits Dedicated page Bandit Summary, Real Data, and Online Learning Chapter 2; Bandits: Nonstationary (Optional) Alternative Bandit Designs Chapter 2 High-Level Overview of Reinforcement Learning Topic Where to find it What is Reinforcement Learning? Chapter 1 From Bandits to Full Reinforcement Learning Chapter 1, Chapter 2 Markov Decision Processes Chapter 3 MDP Section Topic Where to find it MDP Section Introduction Chapter 3: MDPs Gridworld Dedicated page Choosing Rewards Dedicated page The Markov Property Chapter 3 Markov Decision Processes (MDPs) Chapter 3 Future Rewards Chapter 4: Reward Hypothesis, Chapter 5: Value Functions Value Functions Chapter 5 The Bellman Equation (pt 1–3) Chapter 6: The Bellman Equations Bellman Examples Chapter 6 Optimal Policy and Optimal Value Function (pt 1–2) Chapter 6 MDP Summary Chapter 3 – Chapter 6 Dynamic Programming Topic Where to find it Dynamic Programming Section Introduction Volume 1 Iterative Policy Evaluation Chapter 7 Designing Your RL Program Chapter 7 Gridworld in Code Dedicated page Iterative Policy Evaluation in Code Dedicated page Windy Gridworld Dedicated page Iterative Policy Evaluation for Windy Gridworld Windy Gridworld Policy Improvement Chapter 8: Policy Iteration Policy Iteration Chapter 8 Policy Iteration in Code Chapter 8; DP code walkthrough Policy Iteration in Windy Gridworld Windy Gridworld Value Iteration Chapter 9 Value Iteration in Code Chapter 9; DP code walkthrough Dynamic Programming Summary Chapter 10: Limitations of DP Monte Carlo Topic Where to find it Monte Carlo Intro Chapter 11 Monte Carlo Policy Evaluation Chapter 11 Monte Carlo Policy Evaluation in Code Dedicated page Monte Carlo Control Chapter 11 Monte Carlo Control in Code Monte Carlo in Code Monte Carlo Control without Exploring Starts Chapter 11; Monte Carlo in Code Monte Carlo Control without Exploring Starts in Code Monte Carlo in Code Monte Carlo Summary Chapter 11 Temporal Difference Learning Topic Where to find it Temporal Difference Introduction Chapter 12 TD(0) Prediction Chapter 12 TD(0) Prediction in Code Dedicated page SARSA Chapter 13 SARSA in Code TD, SARSA, Q-Learning in Code Q-Learning Chapter 14 Q-Learning in Code TD, SARSA, Q-Learning in Code TD Learning Section Summary Chapter 12 – Chapter 14 Approximation Methods Topic Where to find it Approximation Methods Section Introduction Volume 3 Linear Models for Reinforcement Learning Chapter 21 Feature Engineering Dedicated page Approximation Methods for Prediction Chapter 21 Approximation Methods for Prediction Code Chapter 21 Approximation Methods for Control Chapter 22 – Chapter 30 Approximation Methods for Control Code Volume 3 CartPole Dedicated page CartPole Code CartPole Approximation Methods Exercise Volume 3 chapters Approximation Methods Section Summary Volume 3 Interlude: Common Beginner Questions Topic Where to find it This Course vs. RL Book: What’s the Difference? Dedicated page Stock Trading Project with Reinforcement Learning Dedicated section Beginners, halt! Stop here if you skipped ahead Stock Trading intro Stock Trading Project Section Introduction Stock Trading Data and Environment Stock Trading: Data and Environment How to Model Q for Q-Learning Stock Trading: How to Model Q Design of the Program Stock Trading: Design Code pt 1–4 Stock Trading Stock Trading Project Discussion Stock Trading Appendix / FAQ Topic Where to find it What is the Appendix? Appendix index Setting Up Your Environment Dedicated page Pre-Installation Check Setting Up Your Environment Anaconda Environment Setup Dedicated page How to install Numpy, Scipy, Matplotlib, Pandas, IPython, Theano, TensorFlow Installing Libraries How to Code by Yourself (part 1) Dedicated page How to Code by Yourself (part 2) Dedicated page Proof that using Jupyter Notebook is the same as not using it Appendix Python 2 vs Python 3 Prerequisites: Python Effective Learning Strategies Dedicated page How to Succeed in this Course (Long Version) Dedicated page Is this for Beginners or Experts? Academic or Practical? Pace Dedicated page Machine Learning and AI Prerequisite Roadmap (pt 1–2) Dedicated page Part 2 — Advanced (Volumes 4–10) After the topics above, the curriculum continues with 70 more chapters in order: ...

March 10, 2026 · 5 min · 1002 words · codefrydev

Chapter 2: Multi-Armed Bandits

Learning objectives Implement a multi-armed bandit environment with Gaussian rewards. Compare epsilon-greedy and greedy policies in terms of average reward and regret. Recognize the exploration–exploitation trade-off in a simple setting. Concept and real-world RL A multi-armed bandit is an RL problem with a single state: the agent repeatedly chooses an “arm” (action) and receives a reward drawn from a distribution associated with that arm. The goal is to maximize cumulative reward. Exploration (trying different arms) is needed to discover which arm has the highest mean; exploitation (choosing the best arm so far) maximizes immediate reward. In practice, bandits model A/B testing, clinical trials, and recommender systems (which ad or item to show). The 10-armed testbed is a standard benchmark: 10 arms with different unknown means; the agent learns from experience. ...

March 10, 2026 · 4 min · 679 words · codefrydev

Bandits: Optimistic Initial Values

Learning objectives Understand why initializing action values optimistically can encourage exploration. Implement optimistic initial values and compare with epsilon-greedy on the 10-armed testbed. Recognize when optimistic initialization helps (stationary, deterministic-ish) and when it does not (nonstationary). Theory Optimistic initial values mean we set \(Q(a)\) to a value higher than the typical reward at the start (e.g. \(Q(a) = 5\) when rewards are usually in \([-2, 2]\)). The agent then chooses the arm with the highest \(Q(a)\). After a pull, the running mean update \(\bar{Q}_{n+1} = \bar{Q}_n + \frac{1}{n+1}(r - \bar{Q}_n)\) brings \(Q(a)\) down toward the true mean. So every arm looks “good” at first; as an arm is pulled, its \(Q\) drops toward reality. The agent is naturally encouraged to try all arms before settling, which is a form of exploration without epsilon. ...

March 10, 2026 · 2 min · 305 words · codefrydev

Chapter 3: Markov Decision Processes (MDPs)

Learning objectives Define an MDP: states, actions, transition probabilities, and rewards. Write transition probability matrices \(P(s’ | s, a)\) for a small MDP. Recognize the Markov property: the next state and reward depend only on the current state and action. Concept and real-world RL A Markov Decision Process (MDP) is the standard mathematical model for RL: a set of states, a set of actions, transition probabilities \(P(s’, r | s, a)\), and a discount factor. The Markov property says that the future (next state and reward) depends only on the current state and action, not on earlier history. That allows us to plan using the current state alone. Real-world examples include board games (state = board position), robot navigation (state = position/velocity), and queue control (state = queue lengths). Writing out \(P\) and reward tables for a tiny MDP is the first step toward value iteration and policy iteration. ...

March 10, 2026 · 3 min · 574 words · codefrydev

Bandits: UCB1

Learning objectives Understand the UCB1 action-selection rule and why it explores uncertain arms. Implement UCB1 on the 10-armed testbed and compare with epsilon-greedy. Interpret the exploration bonus \(c \sqrt{\ln t / N(a)}\). Theory UCB1 (Upper Confidence Bound) chooses the action that maximizes an upper bound on the expected reward: [ a_t = \arg\max_a \left[ Q(a) + c \sqrt{\frac{\ln t}{N(a)}} \right] ] \(Q(a)\) is the sample mean reward for arm \(a\). \(N(a)\) is how many times arm \(a\) has been pulled. \(t\) is the total number of pulls so far. \(c\) is a constant (e.g. 2) that controls exploration. The term \(c \sqrt{\ln t / N(a)}\) is an exploration bonus: arms that have been pulled less often (small \(N(a)\)) get a higher bonus, so they are tried more. As \(N(a)\) grows, the bonus shrinks. So UCB1 explores systematically rather than randomly (unlike epsilon-greedy). ...

March 10, 2026 · 2 min · 319 words · codefrydev

Chapter 4: The Reward Hypothesis

Learning objectives State the reward hypothesis: that goals can be captured by scalar reward signals. Design a reward function for a concrete task and anticipate unintended behavior. Identify and fix “reward hacking” (exploiting the reward design instead of the intended goal). Concept and real-world RL The reward hypothesis says that we can capture what we want the agent to do by defining a scalar reward at each step; the agent’s goal is to maximize cumulative reward. In practice, reward design is hard: the agent will optimize exactly what you reward, so oversimplified or buggy rewards lead to reward hacking (e.g. the agent finds a loophole that yields high reward without achieving the real goal). Examples: a robot rewarded for “distance to goal” might push the goal; a game agent rewarded for “score” might find a way to increment score without playing. Self-driving, robotics, and game AI all require careful reward shaping and testing for exploits. ...

March 10, 2026 · 4 min · 709 words · codefrydev

Gridworld

Learning objectives Define a gridworld MDP: grid cells as states, actions (up/down/left/right), transitions, and terminal states. Understand how hitting the boundary keeps the agent in place (or wraps, depending on design). Use gridworld as the running example for policy evaluation and policy iteration. What is Gridworld? Gridworld is a simple MDP used throughout RL teaching and research. The environment is a grid of cells (e.g. 4×4 or 5×5). The state is the agent’s position \((i, j)\). Actions are typically up, down, left, right. Transitions: taking an action moves the agent one cell in that direction; if the move would go off the grid, the agent either stays in place (and usually receives the same step reward) or the world wraps, depending on the specification. ...

March 10, 2026 · 2 min · 356 words · codefrydev

Bandits: Thompson Sampling

Learning objectives Understand the Bayesian view: maintain a posterior over each arm’s reward distribution. Implement Thompson Sampling for Bernoulli and Gaussian rewards. Compare Thompson Sampling with epsilon-greedy and UCB1. Theory (pt 1): Bernoulli bandits Suppose each arm gives a reward 0 or 1 (e.g. click or no click). We model arm \(a\) as Bernoulli with unknown mean \(\theta_a\). A convenient prior is Beta: \(\theta_a \sim \text{Beta}(\alpha_a, \beta_a)\). After observing \(s\) successes and \(f\) failures from arm \(a\), the posterior is \(\text{Beta}(\alpha_a + s, \beta_a + f)\). ...

March 10, 2026 · 2 min · 401 words · codefrydev

Chapter 5: Value Functions

Learning objectives Define the state-value function \(V^\pi(s)\) as the expected return from state \(s\) under policy \(\pi\). Write and solve the Bellman expectation equation for a small MDP. Use matrix form (linear system) when the MDP is finite. Concept and real-world RL The state-value function \(V^\pi(s)\) is the expected (discounted) return starting from state \(s\) and following policy \(\pi\). It answers: “How good is it to be in this state if I follow this policy?” In games, \(V(s)\) is like the expected outcome from a board position; in navigation, it is the expected cumulative reward from a location. The Bellman expectation equation expresses \(V^\pi\) in terms of immediate reward and the value of the next state; for finite MDPs it becomes a linear system \(V = r + \gamma P V\) that we can solve by matrix inversion or iteration. ...

March 10, 2026 · 3 min · 620 words · codefrydev