This page covers value functions and the Bellman equation you need for the preliminary assessment: state-value \(V^\pi(s)\), action-value \(Q^\pi(s,a)\), and the Bellman expectation equation for \(V^\pi\). Back to Preliminary.
Why this matters for RL
Value functions are the expected return from a state (or state-action pair) under a policy. They are the main object we estimate in value-based methods (e.g. TD, Q-learning) and appear in actor-critic as the critic. The Bellman equation is the recursive identity that connects the value at one state to immediate reward and values at successor states; it is the basis of dynamic programming and TD learning.
Learning objectives
Define \(V^\pi(s)\) and \(Q^\pi(s,a)\); write the Bellman expectation equation for \(V^\pi(s)\); interpret it in terms of one-step reward and discounted continuation.
Core concepts
- State-value: \(V^\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s]\) — expected return starting from state \(s\) and following policy \(\pi\).
- Action-value: \(Q^\pi(s,a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a]\) — expected return from \(s\), taking action \(a\), then following \(\pi\).
- Bellman equation for \(V^\pi\): The value at \(s\) equals the expected immediate reward plus the discounted expected value at the next state:
\(V^\pi(s) = \sum_a \pi(a|s) \sum_{s’,r} p(s’,r|s,a)\bigl[r + \gamma V^\pi(s’)\bigr]\).
Worked problems (with explanations)
1. Define \(V^\pi\) and \(Q^\pi\) (Q14)
Q: Define the state-value function \(V^\pi(s)\) and the action-value function \(Q^\pi(s,a)\).
\(V^\pi(s)\) = \(\mathbb{E}_\pi[G_t \mid S_t = s]\) — the expected return (sum of discounted rewards) starting from state \(s\) and following policy \(\pi\) thereafter. It answers: “How good is it to be in state \(s\) if I follow \(\pi\)?” \(Q^\pi(s,a)\) = \(\mathbb{E}_\pi[G_t \mid S_t = s, A_t = a]\) — the expected return starting from state \(s\), taking action \(a\) at the current step, then following \(\pi\) thereafter. It answers: “How good is it to take action \(a\) in state \(s\) and then follow \(\pi\)?” The subscript \(\pi\) means the expectation is over trajectories generated by following \(\pi\). We have \(V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)\): the state value is the average of action values weighted by the policy. In control, we often learn \(Q\) and then choose the action that maximizes \(Q(s,a)\); in prediction we estimate \(V^\pi\) or \(Q^\pi\) for a fixed \(\pi\).Answer and explanation
Explanation
2. Bellman expectation equation (Q15)
Q: Write the Bellman expectation equation for \(V^\pi(s)\) in terms of rewards and next-state values.
\(V^\pi(s) = \sum_a \pi(a|s) \sum_{s’,r} p(s’,r|s,a) \bigl[r + \gamma V^\pi(s’)\bigr]\). In words: the value of \(s\) under \(\pi\) is the expectation (over actions from \(\pi\) and over next state and reward from the environment) of “immediate reward \(r\) plus discounted value of the next state \(V^\pi(s’)\).” This is a recursive identity: it expresses \(V^\pi(s)\) in terms of \(V^\pi(s’)\) at successor states. So we can solve for \(V^\pi\) by iteration (value iteration, policy evaluation) or estimate it from experience (TD learning). The inner sum \(\sum_{s’,r} p(s’,r|s,a)[r + \gamma V^\pi(s’)]\) is the expected “one-step return” from \((s,a)\); the outer sum averages over actions according to \(\pi(a|s)\).Answer and explanation
Explanation
3. Worked Bellman expansion (tiny MDP)
Q: Consider two states: \(s_1\) and \(s_2\). From \(s_1\) we have one action that gives reward 0 and goes to \(s_2\) with probability 1. From \(s_2\) we have one action that gives reward 1 and goes to \(s_2\) with probability 1 (self-loop). Policy \(\pi\) is deterministic. Let \(\gamma = 0.9\). Write the Bellman equations and solve for \(V^\pi(s_1)\) and \(V^\pi(s_2)\).
From \(s_2\): \(V^\pi(s_2) = 1 + 0.9 V^\pi(s_2)\), so \(V^\pi(s_2) = 1/(1-0.9) = 10\). From \(s_1\): \(V^\pi(s_1) = 0 + 0.9 V^\pi(s_2) = 0.9 \times 10 = 9\). In \(s_2\) we get reward 1 and stay in \(s_2\), so the return from \(s_2\) is \(1 + \gamma + \gamma^2 + \cdots = 1/(1-\gamma) = 10\). In \(s_1\) we get 0 and move to \(s_2\), so the return from \(s_1\) is \(0 + \gamma V^\pi(s_2) = 9\). This illustrates how the Bellman equation ties together values at different states; solving them (here by substitution) is what policy evaluation does.Answer and explanation
Explanation
The graph below shows the two state values we computed: \(V^\pi(s_1) = 9\) and \(V^\pi(s_2) = 10\).
Math example: structure of the Bellman equation
The Bellman equation says: \(V^\pi(s) = \mathbb{E}\pi\bigl[R_t + \gamma V^\pi(S{t+1}) \mid S_t = s\bigr]\).
So the value at \(s\) is the expected “one-step reward plus discounted value of the next state.” That is a fixed-point equation: \(V^\pi\) is the function that satisfies this for all \(s\). Dynamic programming (policy iteration, value iteration) and temporal-difference learning both use this structure: DP by solving the system of equations, TD by updating estimates from samples of \(r + \gamma V(S’)\).
Professor’s hints
- \(V^\pi\) is a function of state only; \(Q^\pi\) is a function of state and action. For a deterministic policy, \(V^\pi(s) = Q^\pi(s, \pi(s))\).
- The Bellman equation is a consistency condition: if you have the true \(V^\pi\), then the right-hand side equals the left-hand side. Learning algorithms try to make our estimates satisfy this (e.g. TD target \(r + \gamma \hat{V}(s’)\)).
- In continuing tasks, \(\gamma < 1\) is needed so that \(V^\pi(s)\) is finite.
Common pitfalls
- Confusing \(V^\pi\) with \(V^\): \(V^\pi\) is for a fixed policy \(\pi\); \(V^\) is the optimal value function (max over policies). The Bellman optimality equation is different (max over actions).
- Forgetting the discount: The Bellman equation uses \(\gamma V^\pi(s’)\), not \(V^\pi(s’)\). The discount is inside the expectation.
- Wrong expectation: The expectation is over actions from \(\pi\) and over next state/reward from \(p(s’,r|s,a)\). Don’t drop one of these.