By Phase 5, you can implement standard RL algorithms. Reading the original papers lets you go further: understand design choices, see ablations, and extend methods. This guide teaches you to read RL papers efficiently.
Structure of a Typical RL Paper
Most RL papers follow this structure:
| Section | What it contains | How much to read |
|---|---|---|
| Abstract | Summary of problem, method, and main result | Always, first |
| Introduction | Motivation, related work, contributions | Skim on first read |
| Background | MDP formulation, notation, prerequisites | Read if new notation |
| Method | The algorithm (often the core section) | Read carefully |
| Experiments | Environments, baselines, results, ablations | Read for main results |
| Conclusion | Summary and future work | Skim |
| Appendix | Hyperparameters, proofs, extra experiments | Reference as needed |
First read strategy: Abstract → Method → Experiments (main figure) → Introduction. Save appendix for implementation.
How to Read the Math
RL papers use consistent notation. Map it to code:
| Paper notation | Code equivalent | Meaning |
|---|---|---|
| s, a, r, s' | state, action, reward, next_state | One transition |
| π(a|s; θ) | policy_net(state) | Policy output (probabilities) |
| V(s; w) or V_w(s) | value_net(state) | Value function |
| ∇_θ J(θ) | loss.backward(); optimizer.step() | Policy gradient update |
| E_π[…] | mean([... for ep in episodes]) | Expectation under π |
| τ | trajectory | A list of (s,a,r) tuples |
| T | max_steps or episode_length | Horizon |
| δ_t | td_error | Temporal difference error |
| γ | gamma | Discount factor |
Paper Walkthrough 1: DQN (Mnih et al., 2015)
Full title: “Human-level control through deep reinforcement learning”
The core idea (method section in one paragraph): Use a neural network Q(s, a; θ) instead of a Q-table. Train with TD learning (Q-learning target y = r + γ max_{a’} Q(s’, a’; θ⁻)) where θ⁻ are the parameters of a target network (updated less frequently). Sample transitions from an experience replay buffer to break correlations.
Key equations to map to code:
- TD target:
y = r + gamma * target_net(s_next).max()(when not done) - Loss:
L = mean((y - online_net(s)[a])**2)(MSE over batch) - Target network update:
target_net.load_state_dict(online_net.state_dict())every C steps
Hyperparameters to note (from appendix): replay size 1M, batch 32, target update every 10k steps, ε decay 1M steps, learning rate 0.00025.
Common confusion: The paper uses two networks (online and target) but early readers think they’re using one. Check: the loss gradient flows through online_net only, not target_net.
Paper Walkthrough 2: PPO (Schulman et al., 2017)
Full title: “Proximal Policy Optimization Algorithms”
The core idea: Clipped surrogate objective to prevent large policy updates. Collect on-policy data, compute advantages with GAE, update with multiple epochs of mini-batch gradient ascent subject to the clipping constraint.
Key equation to map to code:
L_CLIP(θ) = E_t[min(r_t(θ)·A_t, clip(r_t(θ), 1-ε, 1+ε)·A_t)]
Where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) (probability ratio).
| |
Reading the experiments: Table 1 compares PPO to A2C, TRPO on MuJoCo. The key result: PPO matches TRPO’s performance with simpler implementation and better wall-clock time.
What to look for in ablations: Section 5 shows what happens without clipping (performance degrades). This validates the design choice.
Paper Walkthrough 3: SAC (Haarnoja et al., 2018)
Full title: “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”
The core idea: Maximize expected return AND expected entropy. Off-policy (uses replay buffer). The entropy term encourages exploration and robustness. Temperature α controls the trade-off.
Key objective: J(π) = Σ_t E[r(s_t, a_t) + α H(π(·|s_t))]
Reading strategy: This paper has more math than DQN/PPO. Start with the algorithm box (Algorithm 1) to see the training loop, then read the derivations to understand why each step is correct.
Three critics trick: SAC uses two Q-networks and takes the minimum to reduce overestimation. Q = min(Q1(s,a), Q2(s,a)). Learn to spot these practical tricks in the algorithm box.
Tips for Efficient Paper Reading
- Read the algorithm box first. Most RL papers have a pseudocode box (Algorithm 1). Read it before the prose — it’s the clearest statement of the method.
- Ignore proofs on first read. Theorem statements are useful; proofs are for specialists and can be read later.
- Check the hyperparameter table. Always in the appendix. Copy these when implementing — many papers require specific settings.
- Read related work last. It’s context, not content. Skim for names of methods you haven’t heard of.
- Implement one equation at a time. Map each equation to code before moving to the next. Don’t try to implement the whole paper at once.
- Run the official code. Almost all major RL papers have public code. Compare your implementation to theirs.
Paper Reading Checklist
For each paper you read, fill in:
- Problem: What problem does this paper solve?
- Key idea: One sentence — what is the novelty?
- Algorithm: Can I write the update rule in pseudocode?
- Environments: What benchmarks are used?
- Key result: What is the headline number/figure?
- Hyperparameters: Noted from appendix.
- Code: Official code found at ___.
Where to Find RL Papers
- arXiv cs.LG / cs.AI — preprints of most RL papers
- Papers With Code — papers + code + benchmarks
- Semantic Scholar — citation search, “influential papers”
- Conference proceedings: NeurIPS, ICML, ICLR, ICRA (robotics) are where most RL work is published