Use this self-check after completing Volumes 6–10 (Phase 8). These questions test conceptual understanding and the ability to connect ideas across topics.
1. Model-based vs model-free
Q: A robot has a perfect model of its environment. Should it use model-based or model-free RL, and why?
Answer
2. Compounding error
Q: Why do multi-step model rollouts suffer from compounding error, and how does MBPO address this?
Answer
3. MCTS phases
Q: Name and briefly describe the four phases of Monte Carlo Tree Search.
Answer
- Selection: traverse the tree from root to a leaf using UCB (or similar) to balance exploration/exploitation.
- Expansion: add one or more child nodes to the selected leaf.
- Simulation (rollout): from the new node, run a (random or heuristic) policy to a terminal state.
- Backpropagation: update visit counts and value estimates along the path from the new node back to the root.
4. Hard exploration
Q: Standard epsilon-greedy fails on Montezuma’s Revenge (sparse reward). Why? Name one method designed for hard exploration.
Answer
5. Offline RL distribution shift
Q: Why is distribution shift dangerous in offline RL? What does CQL do to address it?
Answer
6. Behavioral cloning vs DAgger
Q: What is the key difference between behavioral cloning and DAgger? When does behavioral cloning fail?
Answer
Behavioral cloning trains on a fixed dataset of expert (state, action) pairs — supervised learning. It fails due to distribution shift: the learned policy makes small errors that put it in states not seen in training, where it makes larger errors, leading to compounding failures.
DAgger fixes this by iteratively running the current policy, querying the expert for correct actions in the newly visited states, and adding these to the training set. This reduces distribution shift by training on states the learner actually visits.
7. Multi-agent credit assignment
Q: In CTDE (centralized training, decentralized execution), what does the critic have access to during training that the actors don’t have during execution?
Answer
8. RLHF pipeline
Q: Describe the three-step RLHF pipeline used to align LLMs.
Answer
- Supervised fine-tuning (SFT): fine-tune the base LLM on high-quality demonstration data.
- Reward model training: collect human preference comparisons (pairs of responses, human picks better one). Train a reward model to predict human preferences.
- RL fine-tuning: use PPO (or similar) to optimize the LLM’s policy against the reward model, with a KL penalty to prevent the policy from diverging too far from the SFT model.
9. DPO vs PPO-RLHF
Q: How does DPO avoid training a separate reward model?
Answer
10. Safe RL constraint
Q: In safe RL, what is a constraint violation? Give a concrete example in robotics.
Answer
11. Sim-to-real gap
Q: What is the sim-to-real gap and name two techniques to reduce it?
Answer
The sim-to-real gap is the discrepancy between a simulator’s physics/dynamics and the real world. Policies trained in simulation may fail on real hardware due to this gap.
Techniques: (1) Domain randomisation — vary simulator parameters (friction, mass, noise) during training so the policy is robust to variation. (2) System identification — calibrate the simulator to match real hardware measurements.
12. QMIX factorisation
Q: QMIX factorises the joint Q-function Q_tot as a monotone function of individual Q_i values. What constraint does monotonicity enforce, and why is it useful?
Answer
13. Entropy in SAC
Q: What does the entropy term in SAC’s objective encourage? What happens when temperature α → 0?
Answer
14. Reward hacking
Q: Give one example of reward hacking in a real or hypothetical RL deployment.
Answer
15. Meta-learning
Q: In MAML, what does “meta-learning” mean, and what is the inner loop vs outer loop?
Answer
MAML learns an initialisation θ such that the model can quickly adapt to new tasks with a small number of gradient steps.
Inner loop: for each task, take k gradient steps from θ to get task-specific θ’. This is fast adaptation.
Outer loop: update θ so that the inner-loop adaptation achieves good performance across all tasks. The meta-objective is to find the best initialisation, not the best parameters for any one task.
Score: 12–15: Strong Phase 5 understanding. 9–11: Review specific volumes. Below 9: Return to the volumes covering missed topics before moving to research.