How to Read RL Papers

By Phase 5, you can implement standard RL algorithms. Reading the original papers lets you go further: understand design choices, see ablations, and extend methods. This guide teaches you to read RL papers efficiently.

Structure of a Typical RL Paper

Most RL papers follow this structure:

Section	What it contains	How much to read
Abstract	Summary of problem, method, and main result	Always, first
Introduction	Motivation, related work, contributions	Skim on first read
Background	MDP formulation, notation, prerequisites	Read if new notation
Method	The algorithm (often the core section)	Read carefully
Experiments	Environments, baselines, results, ablations	Read for main results
Conclusion	Summary and future work	Skim
Appendix	Hyperparameters, proofs, extra experiments	Reference as needed

First read strategy: Abstract → Method → Experiments (main figure) → Introduction. Save appendix for implementation.

How to Read the Math

RL papers use consistent notation. Map it to code:

Paper notation	Code equivalent	Meaning
s, a, r, s'	`state, action, reward, next_state`	One transition
π(a\|s; θ)	`policy_net(state)`	Policy output (probabilities)
V(s; w) or V_w(s)	`value_net(state)`	Value function
∇_θ J(θ)	`loss.backward(); optimizer.step()`	Policy gradient update
E_π[…]	`mean([... for ep in episodes])`	Expectation under π
τ	`trajectory`	A list of (s,a,r) tuples
T	`max_steps` or `episode_length`	Horizon
δ_t	`td_error`	Temporal difference error
γ	`gamma`	Discount factor

Paper Walkthrough 1: DQN (Mnih et al., 2015)

Full title: “Human-level control through deep reinforcement learning”

The core idea (method section in one paragraph): Use a neural network Q(s, a; θ) instead of a Q-table. Train with TD learning (Q-learning target y = r + γ max_{a’} Q(s’, a’; θ⁻)) where θ⁻ are the parameters of a target network (updated less frequently). Sample transitions from an experience replay buffer to break correlations.

Key equations to map to code:

TD target: y = r + gamma * target_net(s_next).max() (when not done)
Loss: L = mean((y - online_net(s)[a])**2) (MSE over batch)
Target network update: target_net.load_state_dict(online_net.state_dict()) every C steps

Hyperparameters to note (from appendix): replay size 1M, batch 32, target update every 10k steps, ε decay 1M steps, learning rate 0.00025.

Common confusion: The paper uses two networks (online and target) but early readers think they’re using one. Check: the loss gradient flows through online_net only, not target_net.

Paper Walkthrough 2: PPO (Schulman et al., 2017)

Full title: “Proximal Policy Optimization Algorithms”

The core idea: Clipped surrogate objective to prevent large policy updates. Collect on-policy data, compute advantages with GAE, update with multiple epochs of mini-batch gradient ascent subject to the clipping constraint.

Key equation to map to code:

L_CLIP(θ) = E_t[min(r_t(θ)·A_t, clip(r_t(θ), 1-ε, 1+ε)·A_t)]

Where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) (probability ratio).

1
2
3
4
# PPO clipped loss (pseudocode)
ratio = new_probs / old_probs.detach()
clipped_ratio = ratio.clamp(1 - epsilon, 1 + epsilon)
loss = -torch.min(ratio * advantages, clipped_ratio * advantages).mean()

Reading the experiments: Table 1 compares PPO to A2C, TRPO on MuJoCo. The key result: PPO matches TRPO’s performance with simpler implementation and better wall-clock time.

What to look for in ablations: Section 5 shows what happens without clipping (performance degrades). This validates the design choice.

Paper Walkthrough 3: SAC (Haarnoja et al., 2018)

Full title: “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”

The core idea: Maximize expected return AND expected entropy. Off-policy (uses replay buffer). The entropy term encourages exploration and robustness. Temperature α controls the trade-off.

Key objective: J(π) = Σ_t E[r(s_t, a_t) + α H(π(·|s_t))]

Reading strategy: This paper has more math than DQN/PPO. Start with the algorithm box (Algorithm 1) to see the training loop, then read the derivations to understand why each step is correct.

Three critics trick: SAC uses two Q-networks and takes the minimum to reduce overestimation. Q = min(Q1(s,a), Q2(s,a)). Learn to spot these practical tricks in the algorithm box.

Tips for Efficient Paper Reading

Read the algorithm box first. Most RL papers have a pseudocode box (Algorithm 1). Read it before the prose — it’s the clearest statement of the method.
Ignore proofs on first read. Theorem statements are useful; proofs are for specialists and can be read later.
Check the hyperparameter table. Always in the appendix. Copy these when implementing — many papers require specific settings.
Read related work last. It’s context, not content. Skim for names of methods you haven’t heard of.
Implement one equation at a time. Map each equation to code before moving to the next. Don’t try to implement the whole paper at once.
Run the official code. Almost all major RL papers have public code. Compare your implementation to theirs.

Paper Reading Checklist

For each paper you read, fill in:

Problem: What problem does this paper solve?
Key idea: One sentence — what is the novelty?
Algorithm: Can I write the update rule in pseudocode?
Environments: What benchmarks are used?
Key result: What is the headline number/figure?
Hyperparameters: Noted from appendix.
Code: Official code found at ___.

Where to Find RL Papers

arXiv cs.LG / cs.AI — preprints of most RL papers
Papers With Code — papers + code + benchmarks
Semantic Scholar — citation search, “influential papers”
Conference proceedings: NeurIPS, ICML, ICLR, ICRA (robotics) are where most RL work is published

Structure of a Typical RL Paper#

How to Read the Math#

Paper Walkthrough 1: DQN (Mnih et al., 2015)#

Paper Walkthrough 2: PPO (Schulman et al., 2017)#

Paper Walkthrough 3: SAC (Haarnoja et al., 2018)#

Tips for Efficient Paper Reading#

Paper Reading Checklist#

Where to Find RL Papers#