Chapter 1: The Reinforcement Learning Framework

Learning objectives Identify the main components of an RL system: agent, environment, state, action, reward. Compute the discounted return for a sequence of rewards. Relate the gridworld to real tasks (e.g. navigation, games) where an agent gets delayed reward. Concept and real-world RL In reinforcement learning, an agent interacts with an environment: at each step the agent is in a state, chooses an action, and receives a reward and a new state. The return is the sum of (discounted) rewards along a trajectory; the agent’s goal is to maximize this return. A gridworld is a simple environment where states are cells and actions move the agent; it models robot navigation (e.g. a robot moving to a goal in a warehouse) and game AI (e.g. a character moving on a map). In robot navigation, the state might be (row, col); the action is up/down/left/right; the reward is +1 at the goal and often 0 or a small penalty per step. Discounting (\(\gamma < 1\)) makes future rewards worth less than immediate ones and keeps the return finite in long or infinite horizons. ...

March 10, 2026 · 4 min · 748 words · codefrydev

Course Outline

This page lists every topic in the intended order: from welcome and bandits through MDPs, dynamic programming, Monte Carlo, temporal difference, approximation methods, projects, and appendix. Follow this outline for a clear basic-to-advanced path. Each item links to the relevant curriculum chapter, prerequisite, or dedicated page. Welcome Topic Where to find it Introduction Home Course Outline and Big Picture This page Where to get the Code Dedicated page How to Succeed in this Course Dedicated page Warmup — Multi-Armed Bandit Topic Where to find it Section Introduction: The Explore-Exploit Dilemma Chapter 2: Multi-Armed Bandits Applications of the Explore-Exploit Dilemma Chapter 2 Epsilon-Greedy Theory Chapter 2 Calculating a Sample Mean (pt 1) Math for RL: Probability Epsilon-Greedy Beginner’s Exercise Prompt Chapter 2 Designing Your Bandit Program Chapter 2 Epsilon-Greedy in Code Chapter 2 Comparing Different Epsilons Chapter 2 Optimistic Initial Values Theory Chapter 2 (hints); Bandits: Optimistic Initial Values Optimistic Initial Values Beginner’s Exercise Prompt Bandits: Optimistic Initial Values Optimistic Initial Values Code Bandits: Optimistic Initial Values UCB1 Theory Dedicated page UCB1 Beginner’s Exercise Prompt Bandits: UCB1 UCB1 Code Bandits: UCB1 Bayesian Bandits / Thompson Sampling Theory (pt 1) Dedicated page Bayesian Bandits / Thompson Sampling Theory (pt 2) Bandits: Thompson Sampling Thompson Sampling Beginner’s Exercise Prompt Bandits: Thompson Sampling Thompson Sampling Code Bandits: Thompson Sampling Thompson Sampling With Gaussian Reward Theory Bandits: Thompson Sampling Thompson Sampling With Gaussian Reward Code Bandits: Thompson Sampling Exercise on Gaussian Rewards Bandits: Thompson Sampling Why don’t we just use a library? Dedicated page Nonstationary Bandits Dedicated page Bandit Summary, Real Data, and Online Learning Chapter 2; Bandits: Nonstationary (Optional) Alternative Bandit Designs Chapter 2 High-Level Overview of Reinforcement Learning Topic Where to find it What is Reinforcement Learning? Chapter 1 From Bandits to Full Reinforcement Learning Chapter 1, Chapter 2 Markov Decision Processes Chapter 3 MDP Section Topic Where to find it MDP Section Introduction Chapter 3: MDPs Gridworld Dedicated page Choosing Rewards Dedicated page The Markov Property Chapter 3 Markov Decision Processes (MDPs) Chapter 3 Future Rewards Chapter 4: Reward Hypothesis, Chapter 5: Value Functions Value Functions Chapter 5 The Bellman Equation (pt 1–3) Chapter 6: The Bellman Equations Bellman Examples Chapter 6 Optimal Policy and Optimal Value Function (pt 1–2) Chapter 6 MDP Summary Chapter 3 – Chapter 6 Dynamic Programming Topic Where to find it Dynamic Programming Section Introduction Volume 1 Iterative Policy Evaluation Chapter 7 Designing Your RL Program Chapter 7 Gridworld in Code Dedicated page Iterative Policy Evaluation in Code Dedicated page Windy Gridworld Dedicated page Iterative Policy Evaluation for Windy Gridworld Windy Gridworld Policy Improvement Chapter 8: Policy Iteration Policy Iteration Chapter 8 Policy Iteration in Code Chapter 8; DP code walkthrough Policy Iteration in Windy Gridworld Windy Gridworld Value Iteration Chapter 9 Value Iteration in Code Chapter 9; DP code walkthrough Dynamic Programming Summary Chapter 10: Limitations of DP Monte Carlo Topic Where to find it Monte Carlo Intro Chapter 11 Monte Carlo Policy Evaluation Chapter 11 Monte Carlo Policy Evaluation in Code Dedicated page Monte Carlo Control Chapter 11 Monte Carlo Control in Code Monte Carlo in Code Monte Carlo Control without Exploring Starts Chapter 11; Monte Carlo in Code Monte Carlo Control without Exploring Starts in Code Monte Carlo in Code Monte Carlo Summary Chapter 11 Temporal Difference Learning Topic Where to find it Temporal Difference Introduction Chapter 12 TD(0) Prediction Chapter 12 TD(0) Prediction in Code Dedicated page SARSA Chapter 13 SARSA in Code TD, SARSA, Q-Learning in Code Q-Learning Chapter 14 Q-Learning in Code TD, SARSA, Q-Learning in Code TD Learning Section Summary Chapter 12 – Chapter 14 Approximation Methods Topic Where to find it Approximation Methods Section Introduction Volume 3 Linear Models for Reinforcement Learning Chapter 21 Feature Engineering Dedicated page Approximation Methods for Prediction Chapter 21 Approximation Methods for Prediction Code Chapter 21 Approximation Methods for Control Chapter 22 – Chapter 30 Approximation Methods for Control Code Volume 3 CartPole Dedicated page CartPole Code CartPole Approximation Methods Exercise Volume 3 chapters Approximation Methods Section Summary Volume 3 Interlude: Common Beginner Questions Topic Where to find it This Course vs. RL Book: What’s the Difference? Dedicated page Stock Trading Project with Reinforcement Learning Dedicated section Beginners, halt! Stop here if you skipped ahead Stock Trading intro Stock Trading Project Section Introduction Stock Trading Data and Environment Stock Trading: Data and Environment How to Model Q for Q-Learning Stock Trading: How to Model Q Design of the Program Stock Trading: Design Code pt 1–4 Stock Trading Stock Trading Project Discussion Stock Trading Appendix / FAQ Topic Where to find it What is the Appendix? Appendix index Setting Up Your Environment Dedicated page Pre-Installation Check Setting Up Your Environment Anaconda Environment Setup Dedicated page How to install Numpy, Scipy, Matplotlib, Pandas, IPython, Theano, TensorFlow Installing Libraries How to Code by Yourself (part 1) Dedicated page How to Code by Yourself (part 2) Dedicated page Proof that using Jupyter Notebook is the same as not using it Appendix Python 2 vs Python 3 Prerequisites: Python Effective Learning Strategies Dedicated page How to Succeed in this Course (Long Version) Dedicated page Is this for Beginners or Experts? Academic or Practical? Pace Dedicated page Machine Learning and AI Prerequisite Roadmap (pt 1–2) Dedicated page Part 2 — Advanced (Volumes 4–10) After the topics above, the curriculum continues with 70 more chapters in order: ...

March 10, 2026 · 5 min · 1002 words · codefrydev

CartPole

Learning objectives Understand the CartPole environment: state (cart position, velocity, pole angle, pole angular velocity), actions (left/right), and reward (+1 per step until termination). Implement a solution using linear function approximation (e.g. tile coding or simple features) and semi-gradient SARSA or Q-learning. Optionally solve with a small neural network (e.g. DQN-style) as in later chapters. What is CartPole? CartPole (also called Inverted Pendulum) is a classic control task in OpenAI Gym / Gymnasium. A pole is attached to a cart that moves on a track. The state is continuous: cart position \(x\), cart velocity \(\dot{x}\), pole angle \(\theta\), pole angular velocity \(\dot{\theta}\). Actions are discrete: 0 = push left, 1 = push right. Reward: +1 for every step until the episode ends. The episode ends when the pole angle goes outside a range (e.g. \(\pm 12°\)) or the cart leaves the track (if bounded), or after a max step count (e.g. 500). So the goal is to keep the pole upright as long as possible (maximize total reward = number of steps). ...

March 10, 2026 · 3 min · 451 words · codefrydev

Stock Trading Project with Reinforcement Learning

Beginners, halt! If you skipped ahead: this project assumes you have completed the core curriculum through temporal difference learning and approximation methods (e.g. Volume 2 and Volume 3 or equivalent). You should understand Q-learning, state and action spaces, and at least linear function approximation. If you have not done that yet, start with the Learning path and Course outline. Stock Trading Project Section Introduction This project walks you through building a simplified RL-based stock trading agent: you define an environment (state = market/position info, actions = buy/sell/hold), a reward (e.g. profit or risk-adjusted return), and train an agent using Q-learning with function approximation. The goal is to understand how to go from theory (Q-learning, FA) to a concrete design and code. ...

March 10, 2026 · 4 min · 717 words · codefrydev