Learning objectives
- Implement QMIX: a mixing network that takes agent Q-values (Q_1,…,Q_n) and the global state s and outputs joint Q_tot, with monotonicity constraint ∂Q_tot/∂Q_i ≥ 0 so that argmax over joint action decomposes to per-agent argmax.
- Enforce monotonicity by generating mixing weights with hypernetworks that take s and output positive weights (e.g. absolute value of network outputs).
- Train with TD on Q_tot using the joint reward; backprop through the mixing network to update both mix weights and individual Q_i.
- Test on a cooperative task and compare with VDN and IQL.
- Relate QMIX to game AI (StarCraft, team coordination) and robot navigation (multi-robot).
Concept and real-world RL
QMIX generalizes VDN by using a mixing network to combine individual Q-values into Q_tot instead of a simple sum. The key constraint is monotonicity: ∂Q_tot/∂Q_i ≥ 0 for each i, so that the joint greedy action is (argmax_{a_1} Q_1, …, argmax_{a_n} Q_n)—no need to search over joint action space. The mixing weights are produced by hypernetworks that take the global state s, so the combination can depend on s (e.g. different roles in different states). In game AI (e.g. StarCraft) and robot navigation, QMIX has been used for cooperative tasks where the additive assumption of VDN is too restrictive.
Where you see this in practice: QMIX and variants (e.g. QTRAN); StarCraft Multi-Agent Challenge; monotonic value decomposition.
Illustration (QMIX monotonicity): QMIX mixes agent Q-values with a monotonic mixing network so global argmax equals per-agent argmax. The chart below shows joint return on a cooperative task over training.
Exercise: Implement QMIX: use a mixing network that takes agent Q-values and the global state to produce a joint Q-value, with monotonicity constraints enforced by positive weights (via hypernetworks). Test on a cooperative task.
Professor’s hints
- Individual Q_i: Same as VDN: each agent has Q_i(o_i, a_i). For the chosen actions (a_1,…,a_n), we have scalars Q_1,…,Q_n.
- Mixing network: Input: (Q_1,…,Q_n) and s. First layer: W_1(s) * [Q_1,…,Q_n] + b_1(s), where W_1(s) is from a hypernetwork (MLP that takes s and outputs a matrix). Use absolute value or softplus on hypernetwork outputs so weights are positive. Deeper mixing: repeat with positive weights. Output: scalar Q_tot. Monotonicity: each layer has non-negative weights, so ∂Q_tot/∂Q_i ≥ 0.
- Hypernetwork: Small MLP: s → vec(W) and s → b. Reshape vec(W) to matrix; apply abs() so W ≥ 0.
- TD training: Same as VDN: target = r + γ max Q_tot(s’, ·); max is done per-agent (greedy) thanks to monotonicity. Loss = (Q_tot - target)^2.
- Use a small cooperative task (e.g. 2–3 agents, grid or particle env) first to debug.
Common pitfalls
- Monotonicity broken: If any weight can be negative, monotonicity fails. Use abs() or softplus on all hypernetwork outputs that become weights. For biases, no constraint.
- Gradient flow: Ensure gradients flow through both the mixing network and the individual Q_i. The mixing network parameters and Q_i parameters are all updated.
- State dependency: The hypernetwork must take the full state (or a sufficient statistic) so that Q_tot can vary with s in a non-additive way.
Worked solution (warm-up: QMIX vs VDN)
Key idea: VDN: \(Q_{tot} = \sum_i Q_i\) (additive). QMIX: \(Q_{tot} = f(Q_1, \ldots, Q_n; s)\) with monotonic \(f\) so that argmax of \(Q_{tot}\) is the concatenation of per-agent argmaxes. The mixing weights can depend on state (e.g. hypernetwork), so we can represent non-additive returns. This allows better credit assignment when the value of one agent’s action depends on the state or others.
Extra practice
- Warm-up: Why does monotonicity (∂Q_tot/∂Q_i ≥ 0) guarantee that the joint greedy action is the concatenation of each agent’s greedy action?
- Coding: Implement QMIX with one hidden layer in the mixing network and hypernetwork that outputs positive weights. Test on the same cooperative task as VDN. Plot Q_tot loss and mean return vs steps. Does QMIX outperform VDN?
- Challenge: Implement QTRAN or a non-monotonic mixing network (no positivity constraint). Compare with QMIX: can it represent more general Q_tot? What is the cost at execution (do you need to search over joint actions)?