Chapter 5: Value Functions

Learning objectives Define the state-value function \(V^\pi(s)\) as the expected return from state \(s\) under policy \(\pi\). Write and solve the Bellman expectation equation for a small MDP. Use matrix form (linear system) when the MDP is finite. Concept and real-world RL The state-value function \(V^\pi(s)\) is the expected (discounted) return starting from state \(s\) and following policy \(\pi\). It answers: “How good is it to be in this state if I follow this policy?” In games, \(V(s)\) is like the expected outcome from a board position; in navigation, it is the expected cumulative reward from a location. The Bellman expectation equation expresses \(V^\pi\) in terms of immediate reward and the value of the next state; for finite MDPs it becomes a linear system \(V = r + \gamma P V\) that we can solve by matrix inversion or iteration. ...

March 10, 2026 · 3 min · 620 words · codefrydev

Chapter 7: Dynamic Programming — Policy Evaluation

Learning objectives Implement iterative policy evaluation (Bellman expectation updates) for a finite MDP. Use a gridworld with terminal states and interpret the resulting value function. Decide when to stop iterating (e.g. max change below a threshold). Concept and real-world RL Policy evaluation computes \(V^\pi\) for a given policy \(\pi\). Iterative policy evaluation starts from an arbitrary \(V\) (e.g. zeros) and repeatedly applies the Bellman expectation update: \(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s’,r} P(s’,r|s,a)[r + \gamma V(s’)]\). This converges to \(V^\pi\) for finite MDPs. In a gridworld, values spread from terminal states (goal or trap); the result shows “how good” each cell is under the policy. This is the building block for policy iteration (evaluate, then improve the policy). ...

March 10, 2026 · 4 min · 703 words · codefrydev

Monte Carlo in Code

Learning objectives Implement first-visit Monte Carlo policy evaluation in code (returns, averaging). Implement Monte Carlo control (estimate Q, improve policy greedily). Implement MC control without exploring starts (e.g. epsilon-greedy behavior). Monte Carlo policy evaluation in code Setup: You have an episodic environment (e.g. blackjack, gridworld) and a fixed policy \(\pi\). Goal: estimate \(V^\pi(s)\). Algorithm: Run an episode: follow \(\pi\), collect \((s_0, r_1, s_1, r_2, \ldots, r_T, s_T)\). For each step \(t\), compute the return from \(t\): \(G_t = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{T-t-1} r_T\) (or loop backward from the end). First-visit: For each state \(s\) that appears in the episode, find the first time \(t\) with \(s_t = s\). Add \(G_t\) to a list (or running sum) for state \(s\); increment the count for \(s\). After many episodes: \(V(s) = \) (sum of returns from first visits to \(s\)) / (count of first visits to \(s\)). Code sketch: Use a dict returns[s] = [] or (total, count). In each episode, track which states have been seen; on first visit to \(s\) at step \(t\), append \(G_t\) (or add to total and increment count). At the end of all episodes, \(V(s) = \) mean of returns for \(s\). ...

March 10, 2026 · 3 min · 464 words · codefrydev