Chapter 5: Value Functions

Learning objectives Define the state-value function \(V^\pi(s)\) as the expected return from state \(s\) under policy \(\pi\). Write and solve the Bellman expectation equation for a small MDP. Use matrix form (linear system) when the MDP is finite. Concept and real-world RL The state-value function \(V^\pi(s)\) is the expected (discounted) return starting from state \(s\) and following policy \(\pi\). It answers: “How good is it to be in this state if I follow this policy?” In games, \(V(s)\) is like the expected outcome from a board position; in navigation, it is the expected cumulative reward from a location. The Bellman expectation equation expresses \(V^\pi\) in terms of immediate reward and the value of the next state; for finite MDPs it becomes a linear system \(V = r + \gamma P V\) that we can solve by matrix inversion or iteration. ...

March 10, 2026 · 3 min · 620 words · codefrydev

Chapter 6: The Bellman Equations

Learning objectives Derive the Bellman optimality equation for \(Q^*(s,a)\) from the definition of optimal action value. Contrast the optimality equation (max over actions) with the expectation equation (average over actions under \(\pi\)). Explain why the optimality equations are nonlinear and how algorithms (e.g. value iteration) handle them. Concept and real-world RL The optimal action-value function \(Q^(s,a)\) is the expected return from state \(s\), taking action \(a\), then acting optimally. The Bellman optimality equation for \(Q^\) states that \(Q^(s,a)\) equals the expected immediate reward plus \(\gamma\) times the maximum over next-state action values (not an average under a policy). This “max” makes the system nonlinear: the optimal policy is greedy with respect to \(Q^\), and \(Q^\) is the fixed point of this equation. Value iteration and Q-learning are built on this; in practice, we approximate \(Q^\) with tables or function approximators. ...

March 10, 2026 · 3 min · 589 words · codefrydev

Chapter 27: Dueling DQN

Learning objectives Implement the dueling architecture: shared backbone, then a value stream \(V(s)\) and an advantage stream \(A(s,a)\), with \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\). Understand why separating \(V\) and \(A\) can help when the value of the state is similar across actions (e.g. in safe states). Compare learning speed and final performance with standard DQN on CartPole. Concept and real-world RL In many states, the value of being in that state is similar regardless of the action (e.g. when no danger is nearby). The dueling architecture represents \(Q(s,a) = V(s) + A(s,a)\), but to get identifiability we use \(Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a’} A(s,a’))\). The network learns \(V(s)\) and \(A(s,a)\) in separate heads after a shared feature layer. This can speed up learning when the advantage (difference between actions) is small in many states. Used in Rainbow and other modern DQN variants. ...

March 10, 2026 · 3 min · 577 words · codefrydev

Value Functions and Bellman Equation

This page covers value functions and the Bellman equation you need for the preliminary assessment: state-value \(V^\pi(s)\), action-value \(Q^\pi(s,a)\), and the Bellman expectation equation for \(V^\pi\). Back to Preliminary. Why this matters for RL Value functions are the expected return from a state (or state-action pair) under a policy. They are the main object we estimate in value-based methods (e.g. TD, Q-learning) and appear in actor-critic as the critic. The Bellman equation is the recursive identity that connects the value at one state to immediate reward and values at successor states; it is the basis of dynamic programming and TD learning. ...

March 10, 2026 · 5 min · 906 words · codefrydev