Take this checkpoint after completing Chapters 1–5 (RL framework, bandits, MDPs, reward hypothesis, value functions). All 5 should feel manageable — if any are unclear, re-read the relevant chapter before continuing.
Q1. Name the five components of an MDP. Write the tuple.
Answer
Q2. In a gridworld, the agent is at (2,1) and moves right to (2,2) which is the goal. What is: (a) the state before the action, (b) the action, (c) the next state, (d) the reward (assuming +1 at goal, -1 per step)?
Answer
Q3. The discount factor γ = 0.9. A reward of +1 arrives after 3 steps. What is its present value?
Answer
Q4. V^π(s) is defined as “the expected discounted return from state s, following policy π.” Write the Bellman expectation equation for V^π(s) in words (no need for full notation).
Answer
V^π(s) = the expected immediate reward (averaged over actions and transitions) plus γ times the expected value of the next state, averaged over the same actions and transitions under policy π.
Formally: V^π(s) = Σ_a π(a|s) Σ_{s’} P(s’|s,a)[R(s,a,s’) + γ V^π(s’)].
Q5. What is the difference between V(s) and Q(s,a)?
Answer
- V(s): value of state s — expected return starting from s and following policy π.
- Q(s,a): value of taking action a in state s — expected return after taking action a in state s, then following policy π.
Relationship: V(s) = Σ_a π(a|s) Q(s,a) (V averages Q over actions under the policy).
All 5 correct? Continue to Chapter 6 (Bellman Equations). Stuck on 2 or more? Re-read Chapters 3–5.