Chapter 17: Planning and Learning with Tabular Methods
Learning objectives Implement a simple model: store \((s,a) \rightarrow (r, s’)\) from experience. Implement Dyna-Q: after each real env step, do \(k\) extra Q-updates using random \((s,a)\) from the model (simulated experience). Compare sample efficiency: Dyna-Q (planning + learning) vs Q-learning (learning only). Concept and real-world RL Model-based methods use a learned or given model of the environment (transition and reward). Dyna-Q learns a tabular model from real experience: when you observe \((s,a,r,s’)\), store it. Then, in addition to updating \(Q(s,a)\) from the real transition, you replay random \((s,a)\) from the model, look up \((r,s’)\), and do a Q-learning update. This gives more learning per real step (planning). In real applications, learned models are used in model-based RL (e.g. world models, MuZero) to reduce sample complexity; the key idea is reusing past experience for extra updates. ...