Volume 2 Drills — Tabular Model-Free Methods | Reinforcement Learning Curriculum

Short problems for Volume 2. Aim for under 5 minutes per problem. All solutions are in collapsible sections.

Recall (R) — State definitions and rules

R1. Write the TD(0) update rule from memory. What are its three inputs?

Answer

TD(0) update:

V(S_t) ← V(S_t) + α [R_{t+1} + γ V(S_{t+1}) − V(S_t)]

Three inputs: (1) the current state S_t, (2) the reward R_{t+1}, (3) the next state S_{t+1}.

The term in brackets is the TD error δ_t = R_{t+1} + γ V(S_{t+1}) − V(S_t). It measures how wrong the current estimate was.

R2. What is the difference between on-policy and off-policy learning? Give one algorithm of each type.

Answer

On-policy: the policy being evaluated/improved is the same policy used to generate experience. Example: SARSA — updates Q using the action actually taken by the current policy.

Off-policy: the policy being evaluated is different from the policy generating experience. The agent can learn about a target policy while following a different behavior policy. Example: Q-learning — updates Q using the greedy action (max), regardless of what action was actually taken.

R3. What is the difference between first-visit and every-visit Monte Carlo?

Answer

Both estimate V(s) by averaging returns, but they differ in which visits count:

First-visit MC: for each episode, only the first visit to state s contributes a return to the average.
Every-visit MC: every visit to state s in an episode contributes a return.

Both converge to V^π as the number of episodes → ∞. First-visit MC has unbiased estimates; every-visit MC has lower variance when states are visited many times per episode.

R4. Why can Q-learning be riskier than SARSA in real-world (non-simulated) environments?

Answer

Q-learning is optimistic — it learns the value of the greedy policy (max Q) even if the agent actually follows a more exploratory behavior policy. During training, the agent can stray into dangerous states while exploring, but Q-learning’s updates assume it will behave optimally from those states.

SARSA learns the value of the policy it actually follows (including exploration). Near cliffs or dangerous states, SARSA will learn to avoid risky actions because it accounts for the exploratory moves that could lead to disaster. This is the classic “Cliff Walking” result: SARSA learns a safer path, Q-learning learns a shorter but risky path near the cliff edge.

R5. What do n-step returns interpolate between? Write the 1-step and ∞-step special cases.

Answer

n-step returns interpolate between TD(0) (1-step bootstrap) and Monte Carlo (full-episode return):

1-step (n=1): G_t^{(1)} = R_{t+1} + γ V(S_{t+1}) — pure bootstrapping (TD target).
∞-step (n=∞): G_t^{(∞)} = R_{t+1} + γ R_{t+2} + … = actual return — pure Monte Carlo (no bootstrapping).
n-step: G_t^{(n)} = R_{t+1} + γ R_{t+2} + … + γ^{n-1} R_{t+n} + γ^n V(S_{t+n}).

TD(λ) takes a weighted average of all n-step returns using geometric weights (λ^{n-1}).

Compute (C) — Numerical exercises

C1. Apply a TD(0) update by hand: V(A) = 0.3, reward r = 0, V(B) = 0.5, α = 0.1, γ = 0.9. The agent transitions A → B. What is the new V(A)?

Try it — edit and run (Shift+Enter)

Answer

TD error δ = 0 + 0.9 × 0.5 − 0.3 = 0.45 − 0.3 = 0.15.

V_new(A) = 0.3 + 0.1 × 0.15 = 0.315.

C2. For the same transition (A → B, r=0, γ=0.9), compare the SARSA and Q-learning update targets. Current Q-values: Q(A, left)=0.4, Q(A, right)=0.6, Q(B, left)=0.7, Q(B, right)=0.5. The agent was in A, took action “right”, landed in B, and chose action “left” next.

Try it — edit and run (Shift+Enter)

Answer

SARSA target = 0 + 0.9 × Q(B, left) = 0 + 0.9 × 0.7 = 0.63.

Q-learning target = 0 + 0.9 × max(Q(B, ·)) = 0 + 0.9 × 0.7 = 0.63.

They agree here because “left” is the greedy action in B. They differ when the next chosen action is not greedy (e.g. during ε-greedy exploration).

C3. Compute the 2-step return G_t^{(2)} for rewards [R_{t+1}=0, R_{t+2}=1] and V(S_{t+2})=0.4. Use γ=0.9.

Try it — edit and run (Shift+Enter)

Answer

G^{(2)} = 0 + 0.9 × 1 + 0.81 × 0.4 = 0.9 + 0.324 = 1.224.

C4. An agent has k=4 actions and uses ε-greedy with ε=0.1. What is the probability of selecting the greedy action?

Try it — edit and run (Shift+Enter)

Answer

P(greedy) = (1 − ε) + ε/k = 0.9 + 0.1/4 = 0.9 + 0.025 = 0.925.

Each non-greedy action has probability ε/k = 0.025.

C5. Compute the incremental mean. Start with Q = 0. Observe rewards [1.2, 0.8, 1.0]. Update using the incremental update rule Q ← Q + (1/n)(R − Q). What is the final Q?

Try it — edit and run (Shift+Enter)

Answer

Step 1: Q = 0 + (1/1)(1.2 − 0) = 1.2. Step 2: Q = 1.2 + (1/2)(0.8 − 1.2) = 1.2 − 0.2 = 1.0. Step 3: Q = 1.0 + (1/3)(1.0 − 1.0) = 1.0.

Final Q = 1.0 (equals the arithmetic mean of [1.2, 0.8, 1.0]).

Code (K) — Implementation

K1. Implement the td0_update(V, s, r, s_next, alpha, gamma) function.

Try it — edit and run (Shift+Enter)

Solution

1
2
3
def td0_update(V, s, r, s_next, alpha=0.1, gamma=0.9):
    td_error = r + gamma * V[s_next] - V[s]
    return V[s] + alpha * td_error

V(A): 0.3 + 0.1 × (0 + 0.9×0.5 − 0.3) = 0.315. V(B): 0.5 + 0.1 × (1 + 0.9×0.0 − 0.5) = 0.545.

K2. Implement epsilon-greedy action selection given a Q-value list (one value per action).

Try it — edit and run (Shift+Enter)

Solution

1
2
3
4
def epsilon_greedy(Q_values, epsilon=0.1):
    if random.random() < epsilon:
        return random.randrange(len(Q_values))
    return Q_values.index(max(Q_values))

Debug (D) — Find and fix the bug

D1. This SARSA implementation has a bug. Find and fix it.

1
2
3
4
5
def sarsa_update(Q, s, a, r, s_next, a_next, alpha=0.1, gamma=0.9):
    # SARSA: on-policy TD control
    td_target = r + gamma * max(Q[s_next].values())  # Bug!
    Q[s][a] += alpha * (td_target - Q[s][a])
    return Q

Answer

The bug: max(Q[s_next].values()) uses the greedy next action — that is the Q-learning target, not SARSA. SARSA must use the actual next action a_next (the one the policy will take).

Fix:

1
td_target = r + gamma * Q[s_next][a_next]

Using max makes it Q-learning (off-policy). SARSA is on-policy: it updates using the next action that was actually sampled from the current policy.

D2. Find the off-by-one error in this episode-end handling:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def run_episode(env, Q, alpha=0.1, gamma=0.9, epsilon=0.1):
    s = env.reset()
    done = False
    while not done:
        a = epsilon_greedy(Q[s], epsilon)
        s_next, r, done = env.step(a)
        # TD update — applied even at terminal step
        td_target = r + gamma * max(Q[s_next].values())  # Bug!
        Q[s][a] += alpha * (td_target - Q[s][a])
        s = s_next

Answer

The bug: when done=True, s_next is a terminal state and Q[s_next] should be 0 (no future value). But the code bootstraps off max(Q[s_next].values()) anyway, which is wrong — it uses whatever stale values happen to be in the Q-table for the terminal state.

Fix: guard with if done:

1
td_target = r if done else r + gamma * max(Q[s_next].values())

Equivalently, initialize terminal state Q-values to 0 and never update them, or use V(terminal) = 0 explicitly.

Challenge (X)

X1. Implement Q-learning on a 3×3 gridworld: start at (0,0), goal at (2,2) with reward +1. Each step costs −0.01. Use ε=0.1, α=0.1, γ=0.99. Train for 5000 episodes and print the learned greedy policy (arrows) and total steps per episode (smoothed).

Try it — edit and run (Shift+Enter)

Hint

Initialize Q[(r,c)] = [0.0, 0.0, 0.0, 0.0] for all states.
Each episode: start at (0,0), loop until done or max 200 steps.
Choose action ε-greedy from Q[s]. Take step. Update Q[s][a] using Q-learning target.
After training, for each state print ACTION_SYMBOLS[argmax(Q[s])].

Expected policy: arrows pointing toward (2,2) via the shortest path.

Recall (R) — State definitions and rules#

Compute (C) — Numerical exercises#

Code (K) — Implementation#

Debug (D) — Find and fix the bug#

Challenge (X)#

Recall (R) — State definitions and rules

Compute (C) — Numerical exercises

Code (K) — Implementation

Debug (D) — Find and fix the bug

Challenge (X)