Short problems for Volume 1. Aim for under 5 minutes per problem. All solutions are in collapsible sections.
Recall (R) — State definitions and rules
R1. Write the Bellman expectation equation for V^π(s) in words: “The value of state s under policy π is ___.”
Answer
“The value of state s under policy π is the expected immediate reward plus the discounted expected value of the next state, averaged over actions (weighted by the policy) and transitions.”
Formally: V^π(s) = Σ_a π(a|s) Σ_{s’} P(s’|s,a) [R(s,a,s’) + γ V^π(s’)].
R2. What are the five components of an MDP? Write the tuple notation.
Answer
R3. What is the Markov property? Why does RL need it?
Answer
The Markov property: P(S_{t+1}|S_0,…,S_t, A_0,…,A_t) = P(S_{t+1}|S_t, A_t). The future depends only on the current state and action, not the full history.
Why RL needs it: Without the Markov property, the agent would need to remember its entire history to make good decisions. The Markov property allows decision-making based only on the current state — making the problem tractable.
R4. What is the difference between policy evaluation and policy iteration?
Answer
Policy evaluation: compute V^π for a fixed policy π (using Bellman expectation equations iteratively).
Policy iteration: alternate between policy evaluation and policy improvement (act greedily w.r.t. V^π) until the policy stops changing. Converges to the optimal policy π*.
R5. When does value iteration converge, and how do you know?
Answer
Value iteration converges when the maximum change in V across all states in one sweep is less than a threshold ε (e.g. 1e-6): max_s |V_new(s) - V_old(s)| < ε.
The Bellman optimality operator is a contraction mapping (with factor γ), guaranteeing convergence to V*.
Compute (C) — Numerical exercises
C1. Compute the discounted return for rewards = [2, 0, -1] and γ = 0.9.
Answer
C2. In a 2-state MDP (A and B), both non-terminal: from A, go to B with probability 1, reward = 0. From B, go to A with probability 1, reward = 1. γ = 0.9. Current estimates: V(A) = 0.5, V(B) = 0.4.
Compute one Bellman backup: V_new(A) and V_new(B).
Answer
C3. A 3×3 gridworld. State (2,2) is the goal (reward +1). All other states give reward 0. Discount γ=0.99. After many iterations, V*(2,2) = 1/(1-γ) if the agent stays in goal, or 1 if the episode ends. What is V*(2,1) if the agent is one step from the goal and γ=0.99?
Answer
If the episode terminates at the goal: V*(2,1) = 0 + γ × V*(2,2) where V*(goal at terminal) = 1 (one-step reward). So V*(2,1) = 0.99 × 1 = 0.99.
If the agent stays in the goal state with reward +1 per step: V*(goal) = 1/(1-0.99) = 100, and V*(2,1) = 0 + 0.99 × 100 = 99. (Depends on episodic vs continuing formulation — episodic gives the simpler answer.)
C4. Policy iteration step: given V^π(A)=0.3, V^π(B)=0.8, and transitions: from state S, action “go to A” gives R=0, action “go to B” gives R=0.1. γ=0.9. What is the greedy action?
Answer
Q(S, go-to-A) = 0 + 0.9 × 0.3 = 0.27. Q(S, go-to-B) = 0.1 + 0.9 × 0.8 = 0.1 + 0.72 = 0.82.
Greedy action: go to B (Q=0.82 > Q=0.27).
C5. One step of value iteration for a 2-action state: actions up and down, both deterministic.
- up → reward 0, next state has V=0.5
- down → reward 1, next state has V=0.2
- γ=0.9
Compute V*(s).
Answer
Code (K) — Implementation
K1. Implement discounted_return(rewards, gamma) without using NumPy.
Solution
| |
K2. Implement one sweep of iterative policy evaluation for a linear chain of 3 states (A, B, C) where C is terminal. Policy: always go right. Reward: 0 for A→B, 0 for B→C, +1 reaching C. γ=0.9.
Solution
Debug (D) — Find and fix the bug
D1. This value iteration code has a bug. Find and fix it.
| |
Answer
D2. Find the bug in this epsilon-greedy implementation:
| |
Answer
random.random() > epsilon is True ~90% of the time (with ε=0.1), so the code mostly does random exploration (not greedy). Fix: if random.random() < epsilon: for exploration, else: return Q.index(max(Q)) for exploitation.Challenge (X)
X1. Implement full policy iteration on a 3×3 gridworld: start at (0,0), goal at (2,2) with reward +1, -1 per step, walls give -1. Use a tabular representation. Report the optimal policy (one arrow per cell) and number of iterations until convergence.
Hint
- Initialize V=0, π=random for all states.
- Policy evaluation: iterate Bellman expectation until convergence.
- Policy improvement: for each state, set π(s) = argmax_a Q(s,a) using current V.
- Repeat until π doesn’t change. Expected result: arrows pointing toward (2,2) along the optimal path.