RLHF
Overall Progress
0%
Bradley-Terry from pairwise comparisons; train policy with PPO.
Simulated preference data; Bradley-Terry reward model; PPO finetune.
Review Volume 7 (Exploration, ICM, RND, Go-Explore, Meta-RL) and preview Volume 8 (Offline RL, Imitation Learning, RLHF).
Review Volume 8 (Offline RL, Imitation Learning, IRL, RLHF) and preview Volume 9 (Multi-Agent RL — cooperation, competition, game theory).
Review Volume 9 (Multi-Agent RL, game theory, QMIX, MAPPO) and preview Volume 10 (Real-World RL — safety, alignment, LLMs, deployment).