This page covers function approximation and deep RL concepts you need for the preliminary assessment: why we need FA, the policy gradient update, exploration in DQN, experience replay, and the advantage of actor-critic. Back to Preliminary.
Why this matters for RL
In large or continuous state spaces we cannot store a value per state; we use a parameterized function (e.g. neural network) to approximate values or policies. That leads to policy gradient methods (maximize return) and value-based methods with FA (e.g. DQN). DQN uses experience replay and exploration (e.g. ε-greedy); actor-critic combines a policy (actor) and a value function (critic) for lower-variance policy gradients. You need to understand why FA is necessary and how these pieces fit together.
Learning objectives
Explain why function approximation is needed; write the policy gradient parameter update; name exploration strategies in DQN; explain experience replay and actor-critic.
Core concepts
- Function approximation (FA): Represent \(V(s)\) or \(Q(s,a)\) (or the policy) with a parameterized function (e.g. \(V(s; w)\), \(Q(s,a; \theta)\)). We generalize from seen states to unseen ones and can handle huge or continuous spaces.
- Policy gradient: Maximize expected return \(J(\theta)\) by gradient ascent: \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\). The gradient is given by the policy gradient theorem (involving \(\nabla_\theta \log \pi(a|s;\theta)\) and returns or advantages).
- Exploration in DQN: ε-greedy (with probability ε take a random action) and noisy networks (learnable noise in weights) are common.
- Experience replay: Store transitions \((s,a,r,s’)\) in a buffer and sample random minibatches to train the Q-network. Breaks correlation between consecutive updates and reuses data.
- Actor-critic: The actor is the policy; the critic is a value function (e.g. \(V(s)\) or \(A(s,a)\)). The critic reduces the variance of policy gradient estimates (e.g. by using a baseline or advantage), leading to faster and more stable learning than plain REINFORCE.
Illustration (learning with FA): With function approximation (e.g. a neural network), episode return typically improves over training. The chart below shows a typical learning curve (e.g. DQN or policy gradient on a simple env).
Worked problems (with explanations)
1. Why function approximation (Q20)
Q: Why is function approximation necessary in RL for large or continuous state spaces?
Tabular methods store one number per state (or per state-action pair). When the state space is huge (e.g. \(10^{20}\) states) or continuous (e.g. \(\mathbb{R}^n\)), we cannot store or visit every state. Function approximation uses a parameterized function (e.g. neural network with a fixed number of parameters) to approximate \(V(s)\) or \(Q(s,a)\) for any \(s\) (and \(a\)). So we generalize from the states we have seen to unseen states; the number of parameters is much smaller than the number of states. That makes learning feasible in large or continuous spaces. In deep RL, the “function” is usually a neural network. We don’t learn a separate value for each state; we learn weights that map state (and possibly action) to a value. That’s why we can apply RL to images, high-dimensional sensors, and continuous control.Answer and explanation
Explanation
2. Policy gradient update (Q21)
Q: In supervised learning, you minimize a loss function \(L(\theta)\) using gradient descent: \(\theta \leftarrow \theta - \alpha \nabla_\theta L\). What is the analogous update in policy gradient methods?
In policy gradient we maximize the expected return \(J(\theta)\), so we use gradient ascent:
\(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\). Here \(J(\theta)\) is the expected return (e.g. average reward per episode or expected discounted return), and \(\nabla_\theta J\) is given by the policy gradient theorem (it involves \(\nabla_\theta \log \pi(a|s;\theta)\) and the return or advantage). So we add a multiple of the gradient instead of subtracting, because we want to increase return, not decrease a loss. In supervised learning we minimize loss (e.g. cross-entropy); in policy gradient we maximize performance. So the sign is opposite: plus for policy gradient, minus for loss minimization. The gradient \(\nabla_\theta J\) tells us how to change \(\theta\) to get more return; we take a step in that direction.Answer and explanation
Explanation
3. Exploration in Deep RL (Q22)
Q: Name two common exploration strategies used in Deep Q-Networks.
Exploration is needed so we don’t get stuck with a suboptimal policy. ε-greedy is simple and widely used; noisy networks provide state-dependent exploration and can be more sample-efficient. Other options include UCB-style bonuses, intrinsic motivation, and entropy regularization in actor-critic.Answer and explanation
Explanation
4. Experience replay (Q23)
Q: Why is experience replay used in DQN? What problem does it solve?
Experience replay stores many transitions \((s, a, r, s’)\) in a buffer and samples random minibatches from this buffer to train the Q-network. It addresses two issues: Without replay, DQN would update on a stream of correlated data and could diverge or learn slowly. Replay makes the training distribution more stationary and diverse. The trade-off is that we learn from off-policy data (old transitions from past policies), which Q-learning already supports because it is off-policy.Answer and explanation
Explanation
5. Actor-critic advantage (Q24)
Q: What is the advantage of using an actor-critic method over pure policy gradient (REINFORCE)?
REINFORCE (pure policy gradient) uses the full return \(G_t\) from the current time step as the scale for \(\nabla_\theta \log \pi(a_t|s_t;\theta)\). The variance of \(G_t\) can be very high (returns vary a lot across trajectories), so learning is slow and unstable. Actor-critic methods use a critic (a value function \(V(s)\) or advantage \(A(s,a)\)) to replace or reduce the return in the gradient estimate. For example, we might use \(A_t = G_t - V(s_t)\) (advantage = return minus baseline) or a TD-based advantage. The critic reduces variance because it subtracts a baseline (so we only reinforce better than average), and/or because it uses lower-variance estimates (e.g. TD) instead of full returns. That typically leads to faster and more stable learning than REINFORCE. The actor is the policy we improve; the critic tells us “how good was that action?” Using the critic, we don’t need to wait for the full return and we reduce variance, so we can update every step and learn more efficiently. Methods like A2C, A3C, PPO, and SAC are actor-critic style.Answer and explanation
Explanation
Code snippet: ε-greedy action (with explanation)
| |
Explanation
Q_s is the vector of Q-values for the current state (one per action). With probability epsilon we ignore Q and pick a random action; otherwise we pick the action with the largest Q-value. This is the standard exploration strategy for DQN and many tabular methods. The same idea applies when Q comes from a neural network: we pass the state through the network to get Q_s, then apply ε-greedy.
Professor’s hints
- Policy gradient: we maximize \(J\), so update is \(\theta \leftarrow \theta + \alpha \nabla_\theta J\). Don’t confuse with loss minimization.
- Experience replay is a key ingredient of DQN; target networks (separate network for the TD target) are another; both improve stability.
- Actor-critic = policy (actor) + value function (critic). The critic is used to form a baseline or advantage so the actor’s gradient has lower variance.
Common pitfalls
- Wrong sign for policy gradient: We add the gradient (ascend), not subtract. Subtracting would minimize return.
- Thinking replay is on-policy: Replay uses old data from past policies; Q-learning is off-policy so that’s fine. For on-policy methods (e.g. many actor-critic variants), we usually don’t use replay or use it carefully (e.g. short buffers).
- Confusing “actor” with “behavior policy”: The actor is the policy we are improving. In actor-critic we typically use the same policy to collect data (on-policy) or combine with off-policy corrections.