Chapter 1: The Reinforcement Learning Framework

Learning objectives Identify the main components of an RL system: agent, environment, state, action, reward. Compute the discounted return for a sequence of rewards. Relate the gridworld to real tasks (e.g. navigation, games) where an agent gets delayed reward. Concept and real-world RL In reinforcement learning, an agent interacts with an environment: at each step the agent is in a state, chooses an action, and receives a reward and a new state. The return is the sum of (discounted) rewards along a trajectory; the agent’s goal is to maximize this return. A gridworld is a simple environment where states are cells and actions move the agent; it models robot navigation (e.g. a robot moving to a goal in a warehouse) and game AI (e.g. a character moving on a map). In robot navigation, the state might be (row, col); the action is up/down/left/right; the reward is +1 at the goal and often 0 or a small penalty per step. Discounting (\(\gamma < 1\)) makes future rewards worth less than immediate ones and keeps the return finite in long or infinite horizons. ...

March 10, 2026 · 4 min · 748 words · codefrydev

Gridworld

Learning objectives Define a gridworld MDP: grid cells as states, actions (up/down/left/right), transitions, and terminal states. Understand how hitting the boundary keeps the agent in place (or wraps, depending on design). Use gridworld as the running example for policy evaluation and policy iteration. What is Gridworld? Gridworld is a simple MDP used throughout RL teaching and research. The environment is a grid of cells (e.g. 4×4 or 5×5). The state is the agent’s position \((i, j)\). Actions are typically up, down, left, right. Transitions: taking an action moves the agent one cell in that direction; if the move would go off the grid, the agent either stays in place (and usually receives the same step reward) or the world wraps, depending on the specification. ...

March 10, 2026 · 2 min · 356 words · codefrydev

Chapter 7: Dynamic Programming — Policy Evaluation

Learning objectives Implement iterative policy evaluation (Bellman expectation updates) for a finite MDP. Use a gridworld with terminal states and interpret the resulting value function. Decide when to stop iterating (e.g. max change below a threshold). Concept and real-world RL Policy evaluation computes \(V^\pi\) for a given policy \(\pi\). Iterative policy evaluation starts from an arbitrary \(V\) (e.g. zeros) and repeatedly applies the Bellman expectation update: \(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). This converges to \(V^\pi\) for finite MDPs. In a gridworld, values spread from terminal states (goal or trap); the result shows “how good” each cell is under the policy. This is the building block for policy iteration (evaluate, then improve the policy). ...

March 10, 2026 · 4 min · 703 words · codefrydev

Chapter 9: Dynamic Programming — Value Iteration

Learning objectives Implement value iteration: repeatedly apply the Bellman optimality update for \(V\). Extract the optimal policy as greedy with respect to the converged \(V\). Relate value iteration to policy iteration (one sweep of “improvement” per state, no full evaluation). Concept and real-world RL Value iteration updates the state-value function using the Bellman optimality equation: \(V(s) \leftarrow \max_a \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). It does not maintain an explicit policy; after convergence, the optimal policy is greedy with respect to \(V\). Value iteration is simpler than full policy iteration (no inner evaluation loop) and converges to \(V^*\). It is used in planning when the model is known; in large or continuous spaces we approximate \(V\) or \(Q\) with function approximators and use approximate dynamic programming or model-free methods. ...

March 10, 2026 · 3 min · 624 words · codefrydev

Dynamic Programming: Gridworld in Code

Learning objectives Implement a 4×4 gridworld environment (states, actions, transitions, rewards) in code. Implement iterative policy evaluation and stop when values converge. Implement policy iteration (evaluate then improve) and optionally value iteration. Gridworld in code States: Use a 4×4 grid. States can be (row, col) or a flat index. Terminal states (0,0) and (3,3) have value 0 and are not updated. Actions: 0=up, 1=down, 2=left, 3=right. Moving off the grid leaves the agent in place. ...

March 10, 2026 · 2 min · 390 words · codefrydev

Chapter 17: Planning and Learning with Tabular Methods

Learning objectives Implement a simple model: store \((s,a) \rightarrow (r, s’)\) from experience. Implement Dyna-Q: after each real env step, do \(k\) extra Q-updates using random \((s,a)\) from the model (simulated experience). Compare sample efficiency: Dyna-Q (planning + learning) vs Q-learning (learning only). Concept and real-world RL Model-based methods use a learned or given model of the environment (transition and reward). Dyna-Q learns a tabular model from real experience: when you observe \((s,a,r,s’)\), store it. Then, in addition to updating \(Q(s,a)\) from the real transition, you replay random \((s,a)\) from the model, look up \((r,s’)\), and do a Q-learning update. This gives more learning per real step (planning). In real applications, learned models are used in model-based RL (e.g. world models, MuZero) to reduce sample complexity; the key idea is reusing past experience for extra updates. ...

March 10, 2026 · 3 min · 583 words · codefrydev

Chapter 53: Planning with Known Models

Learning objectives Implement a planner using breadth-first search (BFS) for a gridworld with known deterministic dynamics. Recover the optimal policy (path to goal) and compare with dynamic programming (value iteration) in terms of computation and result. Relate BFS to shortest-path planning in robot navigation. Concept and real-world RL When the model is known and deterministic, we can plan without learning: BFS finds the shortest path from start to goal; value iteration computes optimal values for all states. In robot navigation (grid or graph), BFS is used for pathfinding; DP is used when we need values everywhere (e.g. for reward shaping). Both assume the model is correct; in RL we often learn the model or the value function from data. ...

March 10, 2026 · 3 min · 443 words · codefrydev

Chapter 62: Intrinsic Motivation

Learning objectives Design an intrinsic reward based on state visitation count: bonus = \(1/\sqrt{\text{count}}\) (or similar) so rarely visited states are more attractive. Implement an agent that uses total reward = extrinsic + intrinsic and compare exploration behavior (e.g. coverage of the state space) with an agent that uses only extrinsic reward. Relate to curiosity and exploration in game AI and robot navigation. Concept and real-world RL Intrinsic motivation gives the agent a bonus for visiting novel or surprising states, so it explores even when extrinsic reward is sparse. Count-based bonus \(1/\sqrt{N(s)}\) (inverse square root of visit count) encourages visiting states that have been seen fewer times. In game AI and robot navigation, this can help discover the goal; in recommendation, novelty bonuses encourage diversity. The combination extrinsic + intrinsic balances exploitation (reward) and exploration (novelty). ...

March 10, 2026 · 3 min · 487 words · codefrydev