Phase 6 Assessment: RL Foundations

Use this quiz after completing Volume 1 and Volume 2 (or the Phase 6 mini-project). If you can answer at least 12 of 15 correctly, you are ready for Phase 7 and Volume 3.

1. RL framework

Q: Name the four main components of an RL system (agent, environment, and two more). What is a state?

Answer

Agent, environment, action, reward. State: a representation of the current situation the agent uses to choose actions.

2. Return

Q: For rewards [0, 0, 1] and \(\gamma = 0.9\), compute the discounted return \(G_0\) from step 0.

Answer

Step 1: Discounts are \(\gamma^0=1, \gamma^1=0.9, \gamma^2=0.81\). Step 2: \(G_0 = r_0 + \gamma r_1 + \gamma^2 r_2 = 0 + 0.9\cdot 0 + 0.81\cdot 1 = 0.81\). In RL: The return from step 0 is the sum of discounted future rewards; we maximize this in every RL algorithm.

3. Markov property

Q: What is the Markov property? Why is it important for planning?

Answer

The future depends only on the current state and action, not on earlier history. It allows us to plan using only the current state (no need to remember the full history).

4. Bellman equation

Q: Write the Bellman expectation equation for \(V^\pi(s)\) in one line (in terms of \(\pi\), \(P\), \(r\), \(\gamma\), \(V^\pi\)).

Answer

\(V^\pi(s) = \sum_a \pi(a|s) \sum_{s’,r} P(s’,r|s,a) [r + \gamma V^\pi(s’)]\).

5. Discount factor

Q: What happens when \(\gamma = 0\)? When \(\gamma = 1\) in a continuing task (no terminal state)?

Answer

\(\gamma=0\): agent is myopic (only immediate reward matters). \(\gamma=1\): future rewards weighted equally; in continuing tasks the return can be infinite unless we use average reward or other formulation.

6. MC vs TD

Q: What is the key difference between Monte Carlo and TD learning in how they update the value estimate?

Answer

MC uses the full return from that state to the end of the episode. TD uses bootstrapping: immediate reward plus the current estimate of the next state’s value (no need to wait for episode end).

7. SARSA vs Q-learning

Q: Write the Q-learning update for a transition \((s, a, r, s’)\). How does the TD target differ from SARSA’s target?

Answer

\(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a’} Q(s’,a’) - Q(s,a)]\). Q-learning uses \(\max_{a’} Q(s’,a’)\); SARSA uses \(Q(s’, a’)\) where \(a’\) is the action actually taken in \(s’\) (on-policy).

8. On-policy vs off-policy

Q: Is Q-learning on-policy or off-policy? Is SARSA? Explain in one sentence each.

Answer

Q-learning: off-policy (learns about the greedy policy while following an exploratory policy, e.g. ε-greedy). SARSA: on-policy (learns about the policy that generates the actions, i.e. ε-greedy).

9. Policy iteration

Q: Name the two steps of policy iteration. What do we do in each?

Answer

Policy evaluation: compute \(V^\pi\) for the current policy (iterate Bellman expectation until convergence). Policy improvement: make the policy greedy w.r.t. current V (or Q). Repeat until policy no longer changes.

10. Value iteration

Q: How does value iteration differ from policy iteration? What update do we do in value iteration?

Answer

Value iteration does not maintain an explicit policy; it iterates \(V_{k+1}(s) = \max_a \sum_{s’,r} P(s’,r|s,a)[r + \gamma V_k(s’)]\) until convergence, then derives the greedy policy from \(V\).

11. First-visit MC

Q: In first-visit MC prediction, how many returns do we use per state per episode? What about every-visit MC?

Answer

First-visit: at most one return per state per episode (the return from the first time we visit that state). Every-visit: we use the return from every time we visit that state in the episode.

12. Function approximation

Q: Why is function approximation needed for large or continuous state spaces?

Answer

Tabular methods store one value per state (or state-action); the number of states can be huge or infinite, so we cannot store or visit them all. Function approximation uses a parameterized function (e.g. linear or neural network) so a fixed number of parameters represent values for all states and generalize from seen to unseen states.

13. Exploration

Q: Give one example of an exploration strategy used in tabular RL. Why is exploration necessary?

Answer

ε-greedy: with probability ε take a random action. Exploration is necessary so we try all actions and learn their values; otherwise we might stick to a suboptimal action forever.

14. Dyna-Q

Q: In Dyna-Q, what is the “model”? How does planning with the model help sample efficiency?

Answer

The model is a representation of the environment (e.g. (s,a) → (s’, r)). We can simulate transitions from the model and perform Q-updates on them without taking real env steps, so we get more learning per real step.

15. Scaling

Q: For a 10×10 grid with 4 actions, how many entries does a tabular Q-table have? Why is this a problem for a 100×100 grid?

Answer

Step 1: Q-table size = states × actions = 100 × 4 = 400 entries. Step 2: For 100×100: 10,000 × 4 = 40,000 entries (still feasible but large). Why it’s a problem: For continuous or very large discrete spaces we have infinitely many or huge numbers of states; we cannot store or visit them all, so we need function approximation (e.g. neural net) with a fixed number of parameters.

Next step: If you passed, go to Phase 4 — Deep RL and Volume 3.

1. RL framework#

2. Return#

3. Markov property#

4. Bellman equation#

5. Discount factor#

6. MC vs TD#

7. SARSA vs Q-learning#

8. On-policy vs off-policy#

9. Policy iteration#

10. Value iteration#

11. First-visit MC#

12. Function approximation#

13. Exploration#

14. Dyna-Q#

15. Scaling#

1. RL framework

2. Return

3. Markov property

4. Bellman equation

5. Discount factor

6. MC vs TD

7. SARSA vs Q-learning

8. On-policy vs off-policy

9. Policy iteration

10. Value iteration

11. First-visit MC

12. Function approximation

13. Exploration

14. Dyna-Q

15. Scaling