15 problems combining Python, probability, and toy RL. Complete before starting Volume 1.
March 19, 2026 · 10 min · 1983 words · codefrydev | Suggest Changes
These 15 exercises sit at Level 2.5 — after prerequisites (Python, NumPy, math) but before Volume 1. They combine the skills you have learned so far in a context that looks like real RL code. If you can complete all 15, you are ready for the curriculum.
Each exercise has an interactive REPL so you can run code in your browser.
Simulate pulling a single bandit arm 1000 times. The arm gives reward Normal(true_mean=0.5, std=1). Compute the sample mean. It should be close to 0.5.
Write valid_moves(state, n=4) that returns a list of valid action indices (0=up, 1=down, 2=left, 3=right) for an n×n grid. An action is valid if it stays in bounds.
Simulate a 1D random walk: 5 positions (0–4). Start at position 2. Each step, go left or right with equal probability. Stop at 0 or 4. Run 2000 episodes. Print what fraction of episodes end at position 4.
Try it — edit and run (Shift+Enter)
Runs in your browser — no install needed
Solution
1
2
3
4
5
6
7
8
9
10
11
importrandomrandom.seed(42)defrandom_walk_episode(start=2):pos=startwhileposnotin(0,4):pos+=1ifrandom.random()<0.5else-1returnposresults=[random_walk_episode()for_inrange(2000)]print(f"Fraction ending at 4: {sum(r==4forrinresults)/len(results):.3f}")
Why this matters: Value function estimation uses many such “episodes” to estimate how often good outcomes occur from each state.
An epsilon-greedy policy selects: random action with probability ε, best action (argmax Q) otherwise. Implement eps_greedy_policy(Q_s, epsilon, n_actions) where Q_s is a list indexed by action.
The true values: Solving V(A) = 0.9V(B) and V(B) = 1 + 0.9V(A) simultaneously gives V(A) ≈ 4.74, V(B) ≈ 5.26 (the sum 0+1=1 reward is collected in every 2-step cycle, heavily discounted over infinite horizon).
Implement a full 2000-step bandit agent: 5 arms with random true means (seed=42). Use epsilon-greedy (ε=0.1) and incremental updates. Report total reward, best arm found, and whether it matches the true best arm.
This function is supposed to compute the first-visit MC return for a list of (state, reward) pairs, but there is a bug. Find and fix it.
1
2
3
4
5
6
7
8
9
deffirst_visit_mc(episode,target_state):"""Return the return G from first visit to target_state."""states=[sfors,rinepisode]rewards=[rfors,rinepisode]iftarget_statenotinstates:returnNonet=states.index(target_state)G=sum(rewards[t:])# bug: should this be rewards[t:] or rewards[t+1:]?returnG
Try it — edit and run (Shift+Enter)
Runs in your browser — no install needed
Answer and explanation
The code is actually correct for this MC formulation. G_t = r_t + r_{t+1} + ... includes the reward received at step t. rewards[t:] starts from the reward received when visiting target_state, which is the standard first-visit MC definition.
The common bug students introduce is using rewards[t+1:] (skipping the immediate reward at the visited state). Whether r_t should be included depends on the convention: some formulations use G_t = r_{t+1} + γr_{t+2} + ... (reward after the state). The key is consistency.
Lesson: Always check whether your return formula includes the reward received at the current step or only future rewards. Sutton & Barto use G_t = R_{t+1} + γR_{t+2} + ... (reward after action), so the reward at step t is rewards[t] in 0-indexed code.
importrandomrandom.seed(0)defstep(state,action):row,col=statedr=[-1,1,0,0];dc=[0,0,-1,1]nr=row+dr[action];nc=col+dc[action]ifnot(0<=nr<=2and0<=nc<=2):returnstate,-1,Falseif(nr,nc)==(2,2):return(2,2),1,Truereturn(nr,nc),-1,FalseQ={}alpha,gamma,epsilon=0.1,0.99,0.1n_actions=4defget_q(s,a):returnQ.get((s,a),0.0)defbest_a(s):returnmax(range(n_actions),key=lambdaa:get_q(s,a))forepinrange(500):s=(0,0);done=False;steps=0whilenotdoneandsteps<50:a=random.randrange(n_actions)ifrandom.random()<epsilonelsebest_a(s)ns,r,done=step(s,a)target=r+(gamma*max(get_q(ns,a2)fora2inrange(n_actions))ifnotdoneelse0)Q[(s,a)]=get_q(s,a)+alpha*(target-get_q(s,a))s=ns;steps+=1steps_list=[]for_inrange(100):s=(0,0);done=False;steps=0whilenotdoneandsteps<50:s,_,done=step(s,best_a(s))steps+=1steps_list.append(steps)print(f"Mean steps to goal: {sum(steps_list)/len(steps_list):.1f}")