RL bugs are uniquely hard to catch because a wrong implementation often still trains — it just learns more slowly or converges to a suboptimal policy. This guide covers the most common bugs, how to detect them, and how to fix them.
The Golden Rule: Test on a Trivial Environment First
Before running DQN on Atari, run it on a 2-state MDP you can solve by hand. If your algorithm can’t learn that, it won’t learn anything. This is called a sanity check.
Sanity check targets by algorithm:
- Q-learning / SARSA: converge to correct Q-values on a 3×3 gridworld in <1000 episodes.
- DQN: solve CartPole (reach 195 average return) in <100k steps.
- PPO: solve CartPole in <50k steps.
- Linear FA: converge on a 5-state random walk in <5000 episodes.
Common Bug 1: Wrong Sign on Reward
Symptoms: Agent learns to avoid the goal; training reward goes down over time.
Cause: Reward is negated: -1 where you meant +1, or the reward is computed as goal - current when it should be current - goal.
Example bug:
| |
Fix: Print a few episode rewards manually. Is the agent rewarded when you expect it to be?
Common Bug 2: Not Using the Done Flag in TD Target
Symptoms: Agent underestimates values near terminal states; learning is slow and noisy.
Cause: When done=True, the TD target should be just r (no bootstrap). Including γ*V(s') for a terminal state adds phantom future value.
Example bug:
| |
Common Bug 3: Gamma Applied Incorrectly in Return Computation
Symptoms: Returns are too high or too low; value estimates don’t match.
Cause: Common forms of this bug:
- Forgetting to apply the discount:
G += rinstead ofG += gamma**t * r - Applying γ once instead of cumulatively in a loop
| |
Common Bug 4: Target Network Updated Every Step
Symptoms: DQN doesn’t converge, loss oscillates wildly.
Cause: The target network should update every N steps (e.g. N=100 or N=1000), not every step. Updating every step makes the target non-stationary (the target changes as fast as the main network), removing the stabilization benefit.
Fix:
| |
Common Bug 5: Wrong Q-learning Target (Using Current Action)
Symptoms: Off-policy Q-learning accidentally becomes on-policy (SARSA).
Cause: Using the next action actually taken in the target, instead of the max.
| |
Logging Strategy
Add these logs to every RL training loop:
| |
What to watch:
- Return should generally increase over time (not immediately, but trend upward).
- Loss should decrease or stabilize (not grow indefinitely).
- Q-values should be in a reasonable range (not exploding to ±∞).
- Epsilon should decrease if you’re using epsilon decay.
5 Find-the-Bug Exercises
Bug Exercise 1
| |
Answer
V[state] += alpha * delta.Bug Exercise 2
| |
Answer
This is actually correct Q-learning — next_action = argmax Q[s_next], which is the greedy action, not the behaviorally chosen next action. The confusion is in the variable name next_action; the code correctly uses max Q-value. However, a common mistaken version would be:
| |
To make it unambiguously Q-learning: target = r + gamma * max(Q[s_next]).
Bug Exercise 3
| |
Answer
This is correct — iterating in reverse and accumulating G = r + γG is equivalent to G_0 = Σ γ^t r_t. The output is r4 + γr3 + γ²r2 + ... = 0.9^4 * 0 + ... + 1 + γ*(0 + γ*(0 + γ*(1 + γ*0))) = 1 + 0.9*(1 + 0) = 1.9. Wait, let me recalculate: rewards=[0,1,0,0,1], reversed=[1,0,0,1,0]. G=0 → G=1+0=1 → G=0+0.9=0.9 → G=0+0.81=0.81 → G=1+0.729=1.729 → G=0+1.556=1.556. This is the correct discounted return from step 0.
This is not a bug — it’s the correct backward accumulation. Many students think iterating in reverse is wrong; it’s actually the standard efficient implementation.
Bug Exercise 4
| |
Answer
Bug Exercise 5
| |
Answer
The bug: done flags are not used in target computation. When d=True (episode ended), the target should be just r, not r + γ * Q_target(s_next). Fix:
| |
This is one of the most common DQN bugs and causes instability near episode boundaries.