Dynamic Programming: Gridworld in Code

Learning objectives Implement a 4×4 gridworld environment (states, actions, transitions, rewards) in code. Implement iterative policy evaluation and stop when values converge. Implement policy iteration (evaluate then improve) and optionally value iteration. Gridworld in code States: Use a 4×4 grid. States can be (row, col) or a flat index. Terminal states (0,0) and (3,3) have value 0 and are not updated. Actions: 0=up, 1=down, 2=left, 3=right. Moving off the grid leaves the agent in place. ...

March 10, 2026 · 2 min · 390 words · codefrydev

Monte Carlo in Code

Learning objectives Implement first-visit Monte Carlo policy evaluation in code (returns, averaging). Implement Monte Carlo control (estimate Q, improve policy greedily). Implement MC control without exploring starts (e.g. epsilon-greedy behavior). Monte Carlo policy evaluation in code Setup: You have an episodic environment (e.g. blackjack, gridworld) and a fixed policy \(\pi\). Goal: estimate \(V^\pi(s)\). Algorithm: Run an episode: follow \(\pi\), collect \((s_0, r_1, s_1, r_2, \ldots, r_T, s_T)\). For each step \(t\), compute the return from \(t\): \(G_t = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{T-t-1} r_T\) (or loop backward from the end). First-visit: For each state \(s\) that appears in the episode, find the first time \(t\) with \(s_t = s\). Add \(G_t\) to a list (or running sum) for state \(s\); increment the count for \(s\). After many episodes: \(V(s) = \) (sum of returns from first visits to \(s\)) / (count of first visits to \(s\)). Code sketch: Use a dict returns[s] = [] or (total, count). In each episode, track which states have been seen; on first visit to \(s\) at step \(t\), append \(G_t\) (or add to total and increment count). At the end of all episodes, \(V(s) = \) mean of returns for \(s\). ...

March 10, 2026 · 3 min · 464 words · codefrydev

TD, SARSA, and Q-Learning in Code

Learning objectives Implement TD(0) prediction in code: update \(V(s)\) after each transition. Implement SARSA (on-policy TD control): update \(Q(s,a)\) using the next action from the behavior policy. Implement Q-learning (off-policy TD control): update \(Q(s,a)\) using the max over next actions. TD(0) prediction in code Goal: Estimate \(V^\pi\) for a fixed policy \(\pi\). Update: After each transition \((s, r, s’)\): [ V(s) \leftarrow V(s) + \alpha \bigl[ r + \gamma V(s’) - V(s) \bigr] ] Use \(V(s’) = 0\) if \(s’\) is terminal. ...

March 10, 2026 · 2 min · 351 words · codefrydev

Where to Get the Code

Learning objectives Find the official repository (if any) for curriculum code and solutions. Know how to run and extend the exercises locally. Where to get the code The curriculum is hosted at GitHub (see the edit link on each page: “Suggest Changes” points to the repo). Code snippets appear inside the chapter pages (exercises, hints, and worked solutions). You can: Copy from the site: Type or copy the code from the exercise and solution sections into your own scripts or notebooks. This is the intended way to learn—you implement and run locally. Clone the repo (if a separate code repo exists): If the project provides a dedicated code repository (e.g. Reinforcement or a code subfolder), clone it and run the examples. Check the home page or the repository README for the exact URL and setup instructions. Use your own code: The exercises describe what to implement; you can write your own from scratch. The worked solutions are there to check your approach. Setup To run the code you write (or clone), you need Python and the libraries used in the curriculum. See Setting Up Your Environment and Installing Libraries in the Appendix. Use a virtual or conda environment so dependencies do not conflict with other projects. ...

March 10, 2026 · 2 min · 240 words · codefrydev