Learning objectives
- Implement a planner using breadth-first search (BFS) for a gridworld with known deterministic dynamics.
- Recover the optimal policy (path to goal) and compare with dynamic programming (value iteration) in terms of computation and result.
- Relate BFS to shortest-path planning in robot navigation.
Concept and real-world RL
When the model is known and deterministic, we can plan without learning: BFS finds the shortest path from start to goal; value iteration computes optimal values for all states. In robot navigation (grid or graph), BFS is used for pathfinding; DP is used when we need values everywhere (e.g. for reward shaping). Both assume the model is correct; in RL we often learn the model or the value function from data.
Where you see this in practice: Game AI pathfinding (A*, BFS); industrial control with known dynamics.
Illustration (BFS vs DP): Planning with BFS finds the shortest path; DP (value iteration) yields the same optimal policy. The chart below shows the number of steps to converge for a small grid (BFS: one run; DP: sweeps).
Exercise: For a simple gridworld with known deterministic dynamics, implement a planner using breadth-first search to find the optimal policy. Compare with dynamic programming.
Professor’s hints
- BFS: from start, expand neighbors (up/down/left/right); track visited; stop when goal is reached. Backtrack to get the action sequence.
- DP: run value iteration until convergence; then extract policy \(\pi(s) = \arg\max_a \sum_{s’} P(s’|s,a)[r + \gamma V(s’)]\).
- Compare: BFS gives one path from start; DP gives policy for every state. For a single start–goal pair, BFS is often cheaper.
Common pitfalls
- Ties: If multiple actions are optimal, BFS may return one path; DP policy may break ties arbitrarily. Be consistent.
- Discount in BFS: BFS finds shortest path (minimum steps); if rewards are -1 per step, that matches. For general rewards, use a cost-sensitive search or DP.
Worked solution (warm-up: planning with a model)
Key idea: With a known or learned model we can plan: e.g. value iteration, policy iteration, or search (BFS/DFS) from the current state. BFS with -1 per step finds the shortest path to the goal. The model gives \(P(s’,r|s,a)\); we use it to compute values or to simulate rollouts without taking real env steps.
Extra practice
- Warm-up: When is BFS preferable to value iteration for finding an optimal policy from a single start state?
- Coding: Implement BFS and value iteration on a 5×5 grid with goal at (4,4). Compare the policy from BFS (from (0,0)) with the policy from DP at (0,0). Do they match?
- Challenge: Add a small probability (e.g. 0.1) of a random transition. Use value iteration only (BFS no longer exact). Compare the optimal policy with the deterministic case.