Chapter 69: RL² (Reinforcement Learning as an RNN)

Learning objectives Implement RL²: an RNN policy whose input at each step is (state, action, reward, done) from the previous step (and current state), and whose output is the action. Explain how the RNN hidden state can encode the “learning algorithm” or belief about the task from the history of experience. Train the RNN on multiple POMDP tasks (or tasks with different dynamics/rewards) so that it learns to adapt its behavior from the history. Evaluate the trained policy on new POMDP tasks and compare with a non-recurrent policy. Relate RL² to dialogue (context-dependent response) and game AI (adapting to different levels or opponents). Concept and real-world RL ...

March 10, 2026 · 4 min · 707 words · codefrydev