This page covers the tabular methods you need for the preliminary assessment: policy iteration and value iteration, the difference between Monte Carlo and TD, on-policy vs off-policy learning, and the Q-learning update rule. Back to Preliminary.
Why this matters for RL
When the state and action spaces are small enough, we can store one value per state (or state-action) and update them from experience or from the model. Dynamic programming does this when we know the model; Monte Carlo and TD do it from samples. Q-learning is the canonical off-policy TD method and is the basis of many deep RL algorithms (e.g. DQN). You need to know how these methods differ and how to write the Q-learning update.
Learning objectives
Name two DP methods for solving MDPs; explain how MC and TD differ in when they update and what they use as target; distinguish on-policy vs off-policy and give examples; write the Q-learning update.
Core concepts
- Policy iteration / Value iteration: DP methods that assume we know transition probabilities. Policy iteration alternates policy evaluation (compute \(V^\pi\)) and policy improvement. Value iteration updates values using the Bellman optimality equation until convergence.
- Monte Carlo: Uses full returns (sum of rewards until end of episode) to update value estimates. Updates only after an episode ends.
- TD (Temporal Difference): Updates using a bootstrapped target: current reward plus discounted estimate of next state value (e.g. \(r + \gamma V(s’)\)). Can update every step; doesn’t need the full return.
- On-policy: Learns about the policy used to generate the data (e.g. SARSA). Off-policy: Learns about a target policy while following a different behavior policy (e.g. Q-learning).
- Q-learning: \(Q(s,a) \leftarrow Q(s,a) + \alpha \bigl[r + \gamma \max_{a’} Q(s’,a’) - Q(s,a)\bigr]\). Uses the max over next-state actions (target policy is greedy) while the behavior policy can be exploratory (e.g. ε-greedy).
Illustration (MC vs TD): Monte Carlo updates only at episode end using the full return; TD updates every step using a bootstrapped target. The chart below shows a typical comparison: average reward per episode over 100 episodes for one-step TD vs MC on the same task.
Worked problems (with explanations)
1. Two DP methods (Q16)
Q: Name two dynamic programming methods for solving MDPs when the model (transition probabilities) is known.
Policy Iteration and Value Iteration. Both assume we know \(p(s’,r|s,a)\). Policy iteration explicitly represents and improves a policy; value iteration works only with values and derives the policy at the end. In model-free RL we don’t have \(p\), so we use MC or TD instead, which learn from experience.Answer and explanation
Explanation
2. Monte Carlo vs TD (Q17)
Q: What is the key difference between Monte Carlo and Temporal Difference (TD) learning in terms of updating value estimates?
Monte Carlo waits until the end of an episode to compute the full return \(G_t = r_t + \gamma r_{t+1} + \cdots + \gamma^{T-t} r_T\) and then updates the value estimate using \(G_t\) as the target. So the update uses the actual return from that trajectory. TD uses bootstrapping: it updates using the immediate reward plus the current estimate of the value of the next state, e.g. target \(r_t + \gamma V(s_{t+1})\), without waiting for the episode to finish. So the target is a mix of one real reward and an estimate. MC is unbiased (given the trajectory) but has high variance and must wait for episode end. TD has lower variance (one step of randomness) but is biased because it uses an estimate (\(V(s’)\)) instead of the true return. TD can also learn from incomplete episodes and from continuing tasks. In practice, TD methods (including Q-learning) are very common because they update every step and often learn faster.Answer and explanation
Explanation
3. On-policy vs off-policy (Q18)
Q: Explain the difference between on-policy and off-policy learning. Give one algorithm example for each.
On-policy methods are often simpler but must balance exploration in the same policy we evaluate. Off-policy methods can reuse data (e.g. replay buffers) and learn an optimal policy while exploring, but they can be less stable (e.g. need careful target networks in DQN).Answer and explanation
Explanation
4. Q-learning update rule (Q19)
Q: Write the Q-learning update rule for a transition \((s, a, r, s’)\).
\(Q(s,a) \leftarrow Q(s,a) + \alpha \bigl[r + \gamma \max_{a’} Q(s’, a’) - Q(s,a)\bigr]\). Here \(\alpha\) is the step size (learning rate), \(\gamma\) is the discount factor, and \(r + \gamma \max_{a’} Q(s’, a’)\) is the TD target. The term in brackets is the TD error: how much the target differs from the current estimate. We observe \((s, a, r, s’)\). The target for \(Q(s,a)\) is “immediate reward plus discounted value of the best we can do from \(s’\),” i.e. \(r + \gamma \max_{a’} Q(s’,a’)\). We move our estimate \(Q(s,a)\) a fraction \(\alpha\) of the way toward this target. Over many such updates (with sufficient exploration), \(Q\) converges to the optimal action-value function \(Q^*\).Answer and explanation
Explanation
Code example: Q-learning update
| |
Explanation
We compute the TD target as \(r + \gamma \max_{a’} Q(s’,a’)\); here \(actions\) is the set of possible actions so we can take the max over \(Q(s’,a’)\). Then we perform the update: new estimate = old estimate + \(\alpha\) × (target − old estimate). This is one step of Q-learning. In practice, \(Q\) might be a table (tabular) or a neural network (DQN); the same formula applies, with the network outputting \(Q(s,a)\) and the update done via gradient descent on the squared TD error.
Professor’s hints
- Q-learning is off-policy because the target uses \(\max_{a’} Q(s’,a’)\) (greedy) while the behavior that produced \((s,a,r,s’)\) might be ε-greedy or other.
- SARSA uses \(Q(s’, a’)\) where \(a’\) is the action actually taken in \(s’\); that makes it on-policy.
- In tabular settings, “convergence” of Q-learning to \(Q^*\) requires that all state-action pairs are visited infinitely often (e.g. with ε-greedy exploration).
Common pitfalls
- Using the wrong target: Q-learning uses \(\max_{a’} Q(s’,a’)\); SARSA uses \(Q(s’, a’)\) for the action \(a’\) that was taken. Mixing them changes the algorithm and what it converges to.
- Forgetting the discount: The target must be \(r + \gamma \max_{a’} Q(s’,a’)\), not \(r + \max_{a’} Q(s’,a’)\).
- Confusing “tabular” with “model-free”: Tabular means we store one value per state (or state-action). We can do tabular MC or tabular TD; both are model-free. DP is typically tabular but requires a model.