Chapter 1: The Reinforcement Learning Framework

Learning objectives Identify the main components of an RL system: agent, environment, state, action, reward. Compute the discounted return for a sequence of rewards. Relate the gridworld to real tasks (e.g. navigation, games) where an agent gets delayed reward. Concept and real-world RL In reinforcement learning, an agent interacts with an environment: at each step the agent is in a state, chooses an action, and receives a reward and a new state. The return is the sum of (discounted) rewards along a trajectory; the agent’s goal is to maximize this return. A gridworld is a simple environment where states are cells and actions move the agent; it models robot navigation (e.g. a robot moving to a goal in a warehouse) and game AI (e.g. a character moving on a map). In robot navigation, the state might be (row, col); the action is up/down/left/right; the reward is +1 at the goal and often 0 or a small penalty per step. Discounting (\(\gamma < 1\)) makes future rewards worth less than immediate ones and keeps the return finite in long or infinite horizons. ...

March 10, 2026 · 4 min · 748 words · codefrydev

Chapter 3: Markov Decision Processes (MDPs)

Learning objectives Define an MDP: states, actions, transition probabilities, and rewards. Write transition probability matrices \(P(s’ | s, a)\) for a small MDP. Recognize the Markov property: the next state and reward depend only on the current state and action. Concept and real-world RL A Markov Decision Process (MDP) is the standard mathematical model for RL: a set of states, a set of actions, transition probabilities \(P(s’, r | s, a)\), and a discount factor. The Markov property says that the future (next state and reward) depends only on the current state and action, not on earlier history. That allows us to plan using the current state alone. Real-world examples include board games (state = board position), robot navigation (state = position/velocity), and queue control (state = queue lengths). Writing out \(P\) and reward tables for a tiny MDP is the first step toward value iteration and policy iteration. ...

March 10, 2026 · 3 min · 574 words · codefrydev

Gridworld

Learning objectives Define a gridworld MDP: grid cells as states, actions (up/down/left/right), transitions, and terminal states. Understand how hitting the boundary keeps the agent in place (or wraps, depending on design). Use gridworld as the running example for policy evaluation and policy iteration. What is Gridworld? Gridworld is a simple MDP used throughout RL teaching and research. The environment is a grid of cells (e.g. 4×4 or 5×5). The state is the agent’s position \((i, j)\). Actions are typically up, down, left, right. Transitions: taking an action moves the agent one cell in that direction; if the move would go off the grid, the agent either stays in place (and usually receives the same step reward) or the world wraps, depending on the specification. ...

March 10, 2026 · 2 min · 356 words · codefrydev

Choosing Rewards

Learning objectives Understand how reward choice affects optimal behavior (what the agent will try to maximize). Use step penalties and terminal rewards in gridworld to encourage short paths or goal reaching. Avoid common pitfalls: reward hacking and unintended incentives. Why rewards matter The agent’s goal in an MDP is to maximize cumulative (often discounted) reward. So the reward function defines the task. Changing rewards changes what is “optimal.” Design rewards so that the behavior you want is exactly what maximizes total reward. ...

March 10, 2026 · 2 min · 354 words · codefrydev

Chapter 6: The Bellman Equations

Learning objectives Derive the Bellman optimality equation for \(Q^*(s,a)\) from the definition of optimal action value. Contrast the optimality equation (max over actions) with the expectation equation (average over actions under \(\pi\)). Explain why the optimality equations are nonlinear and how algorithms (e.g. value iteration) handle them. Concept and real-world RL The optimal action-value function \(Q^(s,a)\) is the expected return from state \(s\), taking action \(a\), then acting optimally. The Bellman optimality equation for \(Q^\) states that \(Q^(s,a)\) equals the expected immediate reward plus \(\gamma\) times the maximum over next-state action values (not an average under a policy). This “max” makes the system nonlinear: the optimal policy is greedy with respect to \(Q^\), and \(Q^\) is the fixed point of this equation. Value iteration and Q-learning are built on this; in practice, we approximate \(Q^\) with tables or function approximators. ...

March 10, 2026 · 3 min · 589 words · codefrydev

Windy Gridworld

Learning objectives Understand the Windy Gridworld environment: movement is shifted by a column-dependent wind. Implement the transition model and run iterative policy evaluation and policy iteration on it. Compare with the standard gridworld (no wind). Theory Windy Gridworld (Sutton & Barto) is a rectangular grid (e.g. 7×10) with: States: Cell positions \((row, col)\). Actions: Up, down, left, right (four actions). Wind: Each column has a fixed wind strength (non-negative integer). When the agent takes an action, the resulting row is shifted up by the wind strength (wind blows upward). So from cell \((r, c)\), after applying action “up” you might move to \((r - 1 + \text{wind}[c], c)\); “down” gives \((r + 1 + \text{wind}[c], c)\), etc. The agent cannot go below row 0 or above the grid; positions are clipped to the grid. Terminal state: One goal cell. Typical reward: -1 per step until the goal. So the same action can lead to different next states depending on the column (wind). The MDP is still finite and deterministic given state and action (wind is fixed per column). This makes the problem slightly harder than a plain gridworld and is a good testbed for policy evaluation and policy iteration. ...

March 10, 2026 · 2 min · 392 words · codefrydev

Chapter 93: RL for Algorithmic Trading

Learning objectives Simulate a simple stock market with one asset (e.g. price follows a random walk or a simple mean-reverting process). Design an MDP: state = (price, position, cash, or features); actions = buy / sell / hold (possibly with size); reward = profit (or risk-adjusted return). Train an agent (e.g. DQN or PPO) on this MDP and evaluate its Sharpe ratio (mean return / std return over episodes or over time). Discuss risk management: position limits, drawdown, transaction costs; how the reward and state design affect behavior. Relate the exercise to trading and finance anchor scenarios (state = market + portfolio, action = trade, reward = profit or Sharpe). Concept and real-world RL ...

March 10, 2026 · 4 min · 652 words · codefrydev

Phase 3 Foundations Quiz

Use this quiz after completing Volume 1 and Volume 2 (or the Phase 3 mini-project). If you can answer at least 12 of 15 correctly, you are ready for Phase 4 and Volume 3. 1. RL framework Q: Name the four main components of an RL system (agent, environment, and two more). What is a state? Answer Agent, environment, action, reward. State: a representation of the current situation the agent uses to choose actions. 2. Return Q: For rewards [0, 0, 1] and \(\gamma = 0.9\), compute the discounted return \(G_0\) from step 0. ...

March 10, 2026 · 5 min · 876 words · codefrydev

RL Framework

This page covers the core RL framework you need for the preliminary assessment: the four main components, the Markov property, exploration vs exploitation, and the discount factor. Back to Preliminary. Why this matters for RL Every RL problem is defined by who acts (agent), what they interact with (environment), what they observe (state), what they can do (actions), and what feedback they get (reward). The Markov property and the discount factor shape how we define value functions and algorithms. Exploration vs exploitation is the central tension in learning from experience. ...

March 10, 2026 · 6 min · 1198 words · codefrydev