Chapter 94: RL in Recommender Systems
Learning objectives Build a toy recommender: 100 items, a user model with changing preferences (e.g. latent state that drifts or has context-dependent taste). Define state (e.g. user history, current context), action (which item to show), and reward (e.g. click, watch time, or engagement score). Train an agent with a policy gradient method (e.g. REINFORCE or PPO) to maximize long-term engagement (e.g. cumulative clicks or cumulative reward over a session). Compare with a baseline (e.g. random or greedy to current preference) and report engagement over episodes. Relate the formulation to the recommendation anchor (state = user context, action = item, return = long-term satisfaction). Concept and real-world RL ...