Chapter 77: Generative Adversarial Imitation Learning (GAIL)
Learning objectives Implement GAIL: train a discriminator D(s, a) to distinguish state-action pairs from the expert vs from the current policy; use the discriminator output (or log D) as reward for a policy gradient method. Train the policy to maximize the discriminator reward (i.e. to fool the discriminator) while the discriminator tries to tell expert from agent. Test on a simple task (e.g. CartPole or MuJoCo) and compare imitation quality with behavioral cloning. Explain the connection to GANs: the policy is the generator, the discriminator gives the learning signal. Relate GAIL to robot navigation and game AI where we have expert demos and want to match the expert distribution without hand-designed rewards. Concept and real-world RL ...