Open drills notebook (interactive)

Short problems for Volume 1. Aim for under 5 minutes per problem. All solutions are in collapsible sections.


Recall (R) — State definitions and rules

R1. Write the Bellman expectation equation for V^π(s) in words: “The value of state s under policy π is ___.”

Answer

“The value of state s under policy π is the expected immediate reward plus the discounted expected value of the next state, averaged over actions (weighted by the policy) and transitions.”

Formally: V^π(s) = Σ_a π(a|s) Σ_{s’} P(s’|s,a) [R(s,a,s’) + γ V^π(s’)].


R2. What are the five components of an MDP? Write the tuple notation.

Answer
(S, A, P, R, γ): S = state space, A = action space, P = transition probabilities P(s’|s,a), R = reward function R(s,a,s’), γ = discount factor.

R3. What is the Markov property? Why does RL need it?

Answer

The Markov property: P(S_{t+1}|S_0,…,S_t, A_0,…,A_t) = P(S_{t+1}|S_t, A_t). The future depends only on the current state and action, not the full history.

Why RL needs it: Without the Markov property, the agent would need to remember its entire history to make good decisions. The Markov property allows decision-making based only on the current state — making the problem tractable.


R4. What is the difference between policy evaluation and policy iteration?

Answer

Policy evaluation: compute V^π for a fixed policy π (using Bellman expectation equations iteratively).

Policy iteration: alternate between policy evaluation and policy improvement (act greedily w.r.t. V^π) until the policy stops changing. Converges to the optimal policy π*.


R5. When does value iteration converge, and how do you know?

Answer

Value iteration converges when the maximum change in V across all states in one sweep is less than a threshold ε (e.g. 1e-6): max_s |V_new(s) - V_old(s)| < ε.

The Bellman optimality operator is a contraction mapping (with factor γ), guaranteeing convergence to V*.


Compute (C) — Numerical exercises

C1. Compute the discounted return for rewards = [2, 0, -1] and γ = 0.9.

Try it — edit and run (Shift+Enter)
Answer
G = 2×1 + 0×0.9 + (-1)×0.81 = 2 - 0.81 = 1.19.

C2. In a 2-state MDP (A and B), both non-terminal: from A, go to B with probability 1, reward = 0. From B, go to A with probability 1, reward = 1. γ = 0.9. Current estimates: V(A) = 0.5, V(B) = 0.4.

Compute one Bellman backup: V_new(A) and V_new(B).

Try it — edit and run (Shift+Enter)
Answer
V_new(A) = 0 + 0.9 × 0.4 = 0.36. V_new(B) = 1 + 0.9 × 0.5 = 1.45.

C3. A 3×3 gridworld. State (2,2) is the goal (reward +1). All other states give reward 0. Discount γ=0.99. After many iterations, V*(2,2) = 1/(1-γ) if the agent stays in goal, or 1 if the episode ends. What is V*(2,1) if the agent is one step from the goal and γ=0.99?

Answer

If the episode terminates at the goal: V*(2,1) = 0 + γ × V*(2,2) where V*(goal at terminal) = 1 (one-step reward). So V*(2,1) = 0.99 × 1 = 0.99.

If the agent stays in the goal state with reward +1 per step: V*(goal) = 1/(1-0.99) = 100, and V*(2,1) = 0 + 0.99 × 100 = 99. (Depends on episodic vs continuing formulation — episodic gives the simpler answer.)


C4. Policy iteration step: given V^π(A)=0.3, V^π(B)=0.8, and transitions: from state S, action “go to A” gives R=0, action “go to B” gives R=0.1. γ=0.9. What is the greedy action?

Answer

Q(S, go-to-A) = 0 + 0.9 × 0.3 = 0.27. Q(S, go-to-B) = 0.1 + 0.9 × 0.8 = 0.1 + 0.72 = 0.82.

Greedy action: go to B (Q=0.82 > Q=0.27).


C5. One step of value iteration for a 2-action state: actions up and down, both deterministic.

  • up → reward 0, next state has V=0.5
  • down → reward 1, next state has V=0.2
  • γ=0.9

Compute V*(s).

Try it — edit and run (Shift+Enter)
Answer
Q(up) = 0 + 0.9×0.5 = 0.45. Q(down) = 1 + 0.9×0.2 = 1.18. V*(s) = max(0.45, 1.18) = 1.18 (optimal action: down).

Code (K) — Implementation

K1. Implement discounted_return(rewards, gamma) without using NumPy.

Try it — edit and run (Shift+Enter)
Solution
1
2
def discounted_return(rewards, gamma=0.9):
    return sum(gamma**t * r for t, r in enumerate(rewards))

K2. Implement one sweep of iterative policy evaluation for a linear chain of 3 states (A, B, C) where C is terminal. Policy: always go right. Reward: 0 for A→B, 0 for B→C, +1 reaching C. γ=0.9.

Try it — edit and run (Shift+Enter)
Solution
After 20 iterations: V(C)=1.0, V(B)=0.9, V(A)=0.81 (geometric series under γ=0.9).

Debug (D) — Find and fix the bug

D1. This value iteration code has a bug. Find and fix it.

1
2
3
4
5
6
7
8
9
def value_iteration(V, transitions, gamma=0.9, n_iter=100):
    for _ in range(n_iter):
        for state in V:
            # transitions[state] = list of (reward, next_state)
            V[state] = max(r + gamma * V[next_s]
                          for r, next_s in transitions[state])
    return V

# Bug: V is mutated during iteration — next_s may use updated values
Try it — edit and run (Shift+Enter)
Answer
The bug: in-place mutation of V during a sweep means earlier states (already updated) are used when computing later states, giving inconsistent results. The fix: compute all new values using the old V, then replace V.

D2. Find the bug in this epsilon-greedy implementation:

1
2
3
4
5
def epsilon_greedy(Q, epsilon=0.1):
    import random
    if random.random() > epsilon:    # Bug!
        return random.randrange(len(Q))
    return Q.index(max(Q))
Answer
The condition is reversed. random.random() > epsilon is True ~90% of the time (with ε=0.1), so the code mostly does random exploration (not greedy). Fix: if random.random() < epsilon: for exploration, else: return Q.index(max(Q)) for exploitation.

Challenge (X)

X1. Implement full policy iteration on a 3×3 gridworld: start at (0,0), goal at (2,2) with reward +1, -1 per step, walls give -1. Use a tabular representation. Report the optimal policy (one arrow per cell) and number of iterations until convergence.

Try it — edit and run (Shift+Enter)
Hint
  1. Initialize V=0, π=random for all states.
  2. Policy evaluation: iterate Bellman expectation until convergence.
  3. Policy improvement: for each state, set π(s) = argmax_a Q(s,a) using current V.
  4. Repeat until π doesn’t change. Expected result: arrows pointing toward (2,2) along the optimal path.