Learning objectives

  • Implement VDN: for a cooperative game, define joint Q as the sum of individual Q-values: Q_tot(s, a_1,…,a_n) = Q_1(o_1, a_1) + … + Q_n(o_n, a_n).
  • Train with a joint reward (e.g. team reward): use TD on Q_tot so that the sum of individual Qs approximates the joint return; backprop distributes the gradient to each Q_i.
  • Compare VDN with IQL (each agent trains Q_i on local reward or team reward without factorization) in terms of learning speed and final return.
  • Explain the limitation of VDN: additivity may not hold for all tasks (e.g. when there are strong synergies or redundancies between agents).
  • Relate VDN to game AI (team games) and robot navigation (multi-robot coordination).

Concept and real-world RL

Value Decomposition Networks (VDN) learn a joint Q-value for cooperative multi-agent settings by summing individual Q-values: Q_tot = Σ_i Q_i(o_i, a_i). The joint reward (e.g. team reward) is used to train Q_tot with standard TD; the gradient flows to each Q_i. This provides a credit assignment mechanism (each Q_i gets a share of the joint signal) while keeping execution decentralized (each agent chooses a_i = argmax Q_i(o_i, a_i)). VDN assumes the joint Q can be additively decomposed; when that holds, it often learns faster than IQL. In game AI and robot navigation, VDN is a simple baseline for cooperative MARL.

Where you see this in practice: VDN and QMIX (which generalizes VDN); value decomposition in StarCraft and similar benchmarks.

Illustration (VDN vs IQL): VDN sums individual Q-values for the joint Q; with joint reward it often converges faster than IQL. The chart below compares mean return over episodes.

Exercise: Implement VDN for a cooperative game: sum individual Q-values to form a joint Q-value. Train with a joint reward. Compare with IQL.

Professor’s hints

  • Individual Q_i: Each agent i has a network Q_i(o_i, a_i). Input: observation (and maybe agent id). Output: scalar for each action (or one scalar for continuous action).
  • Joint Q_tot: Q_tot = Q_1(o_1, a_1) + … + Q_n(o_n, a_n). For TD: target = r_joint + γ max_{a’} Q_tot(s’, a’1,…,a’n). The max over joint action is tractable if each agent’s max is independent: max{a’} Q_tot = Σ_i max{a’_i} Q_i(o’_i, a’_i). So each agent can do argmax locally.
  • Training: Sample batch (s, a_1,…,a_n, r, s’). Compute Q_tot = Σ_i Q_i(o_i, a_i). Target = r + γ Σ_i max_{a’_i} Q_i’(o’_i, a’_i). Loss = (Q_tot - target)^2. Backprop updates all Q_i.
  • IQL baseline: Each agent trains Q_i with the same joint reward (each gets r_joint). No factorization; each does TD on Q_i(o_i, a_i) with target r_joint + γ max Q_i(o’_i, a’_i). Compare learning curves.

Common pitfalls

  • Greedy action selection: For the TD target, use max over actions. With VDN, the max of the sum is the sum of maxes (per agent), so each agent can compute max_{a_i} Q_i(o’_i, a_i) independently. Do not use a joint max over (a_1,…,a_n) with a single network.
  • Exploration: Both VDN and IQL need exploration (e.g. ε-greedy or noise on actions). Use the same exploration schedule when comparing.
  • Same reward: Give the same joint reward to both methods so the comparison is fair.
Worked solution (warm-up: QMIX)
Key idea: QMIX learns \(Q_{tot}(s, u)\) as a monotonic combination of per-agent \(Q_i(o_i, u_i)\): \(Q_{tot} = f(Q_1, \ldots, Q_n)\) with \(\partial Q_{tot}/\partial Q_i \geq 0\). So a greedy joint action is computed by each agent choosing greedy w.r.t. its \(Q_i\) (decentralized). The mixing network is trained so that \(Q_{tot}\) fits the true joint return; monotonicity ensures consistency between individual greedy and joint greedy.

Extra practice

  1. Warm-up: Why does summing Q_i to get Q_tot allow each agent to choose its action independently at execution time (argmax a_i Q_i(o_i, a_i))?
  2. Coding: Implement VDN and IQL on a 2-agent cooperative task (e.g. meet-up or a simple grid with joint goal). Train both for 5k episodes. Plot mean return vs episodes for both. Does VDN converge faster?
  3. Challenge: Design a small cooperative task where the optimal joint Q is not additive (e.g. strong synergy: reward only when both agents take a specific joint action). Implement VDN anyway and see how much it underperforms; then try QMIX (next chapter) and compare.
  4. Variant: Change from 2 to 4 agents in the meet-up task. Does VDN’s additive decomposition scale, or does the gap between VDN and true joint Q grow with more agents and more complex coordination requirements?
  5. Debug: A VDN implementation converges but team return is lower than a centralized controller baseline. Logging shows Q_tot is computed as Q_tot = Q_1 + Q_2, but during the Bellman target computation the individual targets are clipped, causing the sum to not satisfy the additive decomposition. Describe how to compute the correct VDN Bellman target without per-agent clipping.
  6. Conceptual: VDN decomposes the joint value function additively. State the IGM (Individual-Global-Max) property and explain why it is necessary for decentralized execution. Does VDN satisfy IGM? What class of cooperative tasks can it represent exactly?