By Phase 5, you can implement standard RL algorithms. Reading the original papers lets you go further: understand design choices, see ablations, and extend methods. This guide teaches you to read RL papers efficiently.


Structure of a Typical RL Paper

Most RL papers follow this structure:

SectionWhat it containsHow much to read
AbstractSummary of problem, method, and main resultAlways, first
IntroductionMotivation, related work, contributionsSkim on first read
BackgroundMDP formulation, notation, prerequisitesRead if new notation
MethodThe algorithm (often the core section)Read carefully
ExperimentsEnvironments, baselines, results, ablationsRead for main results
ConclusionSummary and future workSkim
AppendixHyperparameters, proofs, extra experimentsReference as needed

First read strategy: Abstract → Method → Experiments (main figure) → Introduction. Save appendix for implementation.


How to Read the Math

RL papers use consistent notation. Map it to code:

Paper notationCode equivalentMeaning
s, a, r, s'state, action, reward, next_stateOne transition
π(a|s; θ)policy_net(state)Policy output (probabilities)
V(s; w) or V_w(s)value_net(state)Value function
∇_θ J(θ)loss.backward(); optimizer.step()Policy gradient update
E_π[…]mean([... for ep in episodes])Expectation under π
τtrajectoryA list of (s,a,r) tuples
Tmax_steps or episode_lengthHorizon
δ_ttd_errorTemporal difference error
γgammaDiscount factor

Paper Walkthrough 1: DQN (Mnih et al., 2015)

Full title: “Human-level control through deep reinforcement learning”

The core idea (method section in one paragraph): Use a neural network Q(s, a; θ) instead of a Q-table. Train with TD learning (Q-learning target y = r + γ max_{a’} Q(s’, a’; θ⁻)) where θ⁻ are the parameters of a target network (updated less frequently). Sample transitions from an experience replay buffer to break correlations.

Key equations to map to code:

  1. TD target: y = r + gamma * target_net(s_next).max() (when not done)
  2. Loss: L = mean((y - online_net(s)[a])**2) (MSE over batch)
  3. Target network update: target_net.load_state_dict(online_net.state_dict()) every C steps

Hyperparameters to note (from appendix): replay size 1M, batch 32, target update every 10k steps, ε decay 1M steps, learning rate 0.00025.

Common confusion: The paper uses two networks (online and target) but early readers think they’re using one. Check: the loss gradient flows through online_net only, not target_net.


Paper Walkthrough 2: PPO (Schulman et al., 2017)

Full title: “Proximal Policy Optimization Algorithms”

The core idea: Clipped surrogate objective to prevent large policy updates. Collect on-policy data, compute advantages with GAE, update with multiple epochs of mini-batch gradient ascent subject to the clipping constraint.

Key equation to map to code:

L_CLIP(θ) = E_t[min(r_t(θ)·A_t, clip(r_t(θ), 1-ε, 1+ε)·A_t)]

Where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) (probability ratio).

1
2
3
4
# PPO clipped loss (pseudocode)
ratio = new_probs / old_probs.detach()
clipped_ratio = ratio.clamp(1 - epsilon, 1 + epsilon)
loss = -torch.min(ratio * advantages, clipped_ratio * advantages).mean()

Reading the experiments: Table 1 compares PPO to A2C, TRPO on MuJoCo. The key result: PPO matches TRPO’s performance with simpler implementation and better wall-clock time.

What to look for in ablations: Section 5 shows what happens without clipping (performance degrades). This validates the design choice.


Paper Walkthrough 3: SAC (Haarnoja et al., 2018)

Full title: “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”

The core idea: Maximize expected return AND expected entropy. Off-policy (uses replay buffer). The entropy term encourages exploration and robustness. Temperature α controls the trade-off.

Key objective: J(π) = Σ_t E[r(s_t, a_t) + α H(π(·|s_t))]

Reading strategy: This paper has more math than DQN/PPO. Start with the algorithm box (Algorithm 1) to see the training loop, then read the derivations to understand why each step is correct.

Three critics trick: SAC uses two Q-networks and takes the minimum to reduce overestimation. Q = min(Q1(s,a), Q2(s,a)). Learn to spot these practical tricks in the algorithm box.


Tips for Efficient Paper Reading

  1. Read the algorithm box first. Most RL papers have a pseudocode box (Algorithm 1). Read it before the prose — it’s the clearest statement of the method.
  2. Ignore proofs on first read. Theorem statements are useful; proofs are for specialists and can be read later.
  3. Check the hyperparameter table. Always in the appendix. Copy these when implementing — many papers require specific settings.
  4. Read related work last. It’s context, not content. Skim for names of methods you haven’t heard of.
  5. Implement one equation at a time. Map each equation to code before moving to the next. Don’t try to implement the whole paper at once.
  6. Run the official code. Almost all major RL papers have public code. Compare your implementation to theirs.

Paper Reading Checklist

For each paper you read, fill in:

  • Problem: What problem does this paper solve?
  • Key idea: One sentence — what is the novelty?
  • Algorithm: Can I write the update rule in pseudocode?
  • Environments: What benchmarks are used?
  • Key result: What is the headline number/figure?
  • Hyperparameters: Noted from appendix.
  • Code: Official code found at ___.

Where to Find RL Papers

  • arXiv cs.LG / cs.AI — preprints of most RL papers
  • Papers With Code — papers + code + benchmarks
  • Semantic Scholar — citation search, “influential papers”
  • Conference proceedings: NeurIPS, ICML, ICLR, ICRA (robotics) are where most RL work is published