Chapter 94: RL in Recommender Systems

Learning objectives

Build a toy recommender: 100 items, a user model with changing preferences (e.g. latent state that drifts or has context-dependent taste).
Define state (e.g. user history, current context), action (which item to show), and reward (e.g. click, watch time, or engagement score).
Train an agent with a policy gradient method (e.g. REINFORCE or PPO) to maximize long-term engagement (e.g. cumulative clicks or cumulative reward over a session).
Compare with a baseline (e.g. random or greedy to current preference) and report engagement over episodes.
Relate the formulation to the recommendation anchor (state = user context, action = item, return = long-term satisfaction).

Concept and real-world RL

RL in recommender systems treats recommendation as a sequential decision problem: at each step we observe state (user history, context) and choose an action (which item to show); we get a reward (click, watch time, purchase) and the user state may change. The goal is to maximize long-term engagement (discounted sum of rewards), not just immediate reward. A toy setting might have 100 items, a user with a latent preference that can drift or depend on context, and a simple reward (e.g. 1 if click, 0 otherwise). In recommendation, this connects to bandits and MDPs; policy gradient methods can optimize for delayed effects (e.g. diversity, exploration).

Where you see this in practice: RL for recommendation (YouTube, etc.); bandits and MDPs for sequential recommendation; long-term engagement optimization.

Illustration (recommender engagement): An RL agent maximizes long-term clicks; cumulative reward per episode (e.g. 20 steps) improves as the policy learns user preferences. The chart below shows reward per episode.

Exercise: Build a toy recommender with 100 items and a user model that has changing preferences. Train an agent to maximize long-term user engagement (e.g., cumulative clicks). Use a policy gradient method.

Professor’s hints

User model: e.g. latent vector u_t that drifts: u_{t+1} = 0.9u_t + 0.1noise, or u depends on last K items (context). Probability of click for item i: P(click | u, i) = σ(u^T v_i) where v_i is item embedding. So “changing preferences” = u changes over time.
State: Last L recommended items and outcomes (clicks), or summary; or the agent does not see u (partial obs) and only sees history. Action = which of 100 items to show. Reward = 1 if click, 0 else (or watch time).
Policy: Output distribution over 100 items (or over a subset for efficiency). Use REINFORCE or PPO; episode = one user session (e.g. T = 20 steps). Return = sum of rewards in the session.
Baseline: Random recommendation, or greedy (recommend item with highest estimated P(click) under current user model if you have access). Compare cumulative reward per episode.

Common pitfalls

Partial observability: The agent may not see the user’s true preference u; state is then history only. The policy must learn to infer or explore.
Cold start: With 100 items and little data, many items get few clicks; use exploration (e.g. entropy bonus, or prior over items).
Reward design: Click is a simple reward; in practice, long-term satisfaction may require diversity or novelty. You can add a small bonus for recommending less-seen items.

Worked solution (warm-up: recommender as RL)

Key idea: Recommendation can be framed as RL: state = user context (history, profile), action = which item(s) to show, reward = click, watch time, or purchase. We want to maximize long-term engagement (return). Bandits are the single-step case; full RL handles delayed feedback (e.g. user returns later). We need a simulator or logged data (offline/batch RL) and a reward proxy (e.g. click) that correlates with true satisfaction.

Extra practice

Warm-up: Why might we want to maximize long-term engagement instead of only the next click?
Coding: Implement the toy recommender: 100 items, user u_t with drift, P(click)=σ(u^T v_i). State = last 5 (item, click) pairs. Policy = MLP that outputs logits over 100 items. Train with REINFORCE for 1000 episodes (each episode = 20 steps). Plot cumulative reward per episode. Compare with random and greedy (argmax P(click) if u were known).
Challenge: Add diversity reward: small bonus for recommending items that are different from recently recommended (e.g. negative similarity to last K items). Does the policy learn to diversify and does long-term reward improve?
Variant: Add a user fatigue model: P(click) decreases by 10% for each consecutive recommendation from the same category. Does the RL policy adapt to avoid fatigue (explore across categories) compared to the greedy baseline?
Debug: The recommender RL agent learns to show the same popular item repeatedly because it has the highest P(click) and the episode is short. The reward is purely click-based with no penalty for repetition. Implement an explicit repetition penalty and describe why short-horizon greedy recommendation is harmful for long-term user satisfaction.
Conceptual: Maximizing engagement metrics (clicks, watch time) can lead to “filter bubbles” and user polarization. Explain why RL-based recommendation systems are particularly susceptible to this failure mode compared to supervised collaborative filtering, and describe one constraint or objective modification that partially addresses it.