Curriculum

Chapter 19: Hyperparameter Tuning in Tabular RL

Learning objectives Run a grid search over learning rate \(\alpha\) and exploration \(\epsilon\) for Q-learning. Aggregate results over multiple trials (e.g. mean reward per episode) and visualize with a heatmap. Interpret which hyperparameter combinations work best and why. Concept and real-world RL Hyperparameters (e.g. \(\alpha\), \(\epsilon\), \(\gamma\)) strongly affect learning speed and final performance. Grid search tries every combination in a predefined set; it is simple but costly when there are many parameters. In practice, RL tuning often uses grid search for 2–3 key parameters, or Bayesian optimization / bandit-based tuning for larger spaces. Reporting mean and std over multiple seeds is essential because RL is noisy. Heatmaps (e.g. \(\alpha\) vs \(\epsilon\) with color = mean reward) make good and bad regions visible at a glance. ...

Chapter 20: The Limits of Tabular Methods

Learning objectives Estimate memory for a tabular Q-table (states × actions × bytes per entry). Relate the scale of real problems (e.g. Backgammon, continuous state) to the infeasibility of tables. Argue why function approximation (linear, neural) is necessary for large or continuous spaces. Concept and real-world RL Tabular methods store one value per state (or state-action). When the state space is huge or continuous, this is impossible: Backgammon has on the order of \(10^{20}\) states; a robot with 10 continuous state variables discretized to 100 bins each has \(100^{10}\) cells. ...

Chapter 21: Linear Function Approximation

Learning objectives Represent the action-value function as \(Q(s,a;w) = w^T \phi(s,a)\) with a feature vector \(\phi\). Use tile coding (overlapping grid tilings) to produce binary features for continuous state (e.g. MountainCar). Implement semi-gradient SARSA: update \(w\) using the TD target with current \(Q\) for the next state. Concept and real-world RL Linear function approximation approximates \(Q(s,a) \approx w^T \phi(s,a)\). The weights \(w\) are learned from data; \(\phi(s,a)\) is a fixed or hand-designed feature. Tile coding partitions the state space into overlapping tilings; each tiling is a grid, and the feature vector has a 1 for each tile that contains the state (and the action), so we get a sparse binary vector. This allows generalization across similar states. Semi-gradient methods use the TD target but treat the next-state value as a constant when taking the gradient (no backprop through the target). Linear FA is the simplest form of value approximation and appears in legacy RL and as a baseline. ...

Feature Engineering for Reinforcement Learning

Learning objectives Choose or design feature vectors \(\phi(s)\) or \(\phi(s,a)\) for linear \(V(s) = w^T \phi(s)\) or \(Q(s,a) = w^T \phi(s,a)\). Use tile coding, polynomial features, and normalization appropriately. Understand how feature choice affects generalization and learning speed. Why features matter In linear function approximation, we approximate \(V(s) \approx w^T \phi(s)\) or \(Q(s,a) \approx w^T \phi(s,a)\). The feature vector \(\phi\) determines what the function can represent. Good features capture the right structure (e.g. similar states get similar values) and keep the dimension manageable so that learning is stable and sample-efficient. ...

CartPole

Learning objectives Understand the CartPole environment: state (cart position, velocity, pole angle, pole angular velocity), actions (left/right), and reward (+1 per step until termination). Implement a solution using linear function approximation (e.g. tile coding or simple features) and semi-gradient SARSA or Q-learning. Optionally solve with a small neural network (e.g. DQN-style) as in later chapters. What is CartPole? CartPole (also called Inverted Pendulum) is a classic control task in OpenAI Gym / Gymnasium. A pole is attached to a cart that moves on a track. The state is continuous: cart position \(x\), cart velocity \(\dot{x}\), pole angle \(\theta\), pole angular velocity \(\dot{\theta}\). Actions are discrete: 0 = push left, 1 = push right. Reward: +1 for every step until the episode ends. The episode ends when the pole angle goes outside a range (e.g. \(\pm 12°\)) or the cart leaves the track (if bounded), or after a max step count (e.g. 500). So the goal is to keep the pole upright as long as possible (maximize total reward = number of steps). ...

Chapter 22: Artificial Neural Networks for RL

Learning objectives Build a feedforward neural network that maps state to Q-values (one output per action) in PyTorch. Implement the forward pass and an MSE loss between predicted Q-values and targets. Understand how this network will be used in DQN (next chapter): TD target and gradient update. Concept and real-world RL Neural networks as function approximators let us represent \(Q(s,a)\) (or \(Q(s)\) with one output per action) for high-dimensional or continuous state spaces. The network takes the state (and optionally the action) as input and outputs values; we train it by minimizing TD error (e.g. MSE between predicted Q and target \(r + \gamma \max_{a’} Q(s’,a’)\)). This is the core of Deep Q-Networks (DQN) and many other deep RL algorithms. In practice, we use MLPs for low-dim state (e.g. CartPole) and CNNs for images (e.g. Atari). ...

Chapter 23: Deep Q-Networks (DQN)

Learning objectives Implement full DQN: Q-network, target network, replay buffer, \(\epsilon\)-greedy, and the TD loss (MSE to target \(r + \gamma \max_{a’} Q_{target}(s’,a’)\)). Update the target network periodically (e.g. every 100 steps) by copying the online Q-network. Train on CartPole and plot reward per episode. Concept and real-world RL DQN combines a neural network for Q-values with experience replay (store transitions, sample random minibatches to break correlation) and a target network (separate copy of the network used in the TD target, updated periodically, to stabilize learning). The agent acts \(\epsilon\)-greedy, stores \((s,a,r,s’,\text{done})\) in the buffer, and repeatedly samples a batch, computes targets using the target network, and updates the online network by minimizing MSE. DQN was the first major deep RL success (Atari) and is still a standard baseline for discrete-action tasks. ...

Chapter 24: Experience Replay

Learning objectives Implement a replay buffer that stores transitions \((s, a, r, s’, \text{done})\) with a fixed capacity. Use a circular buffer (overwrite oldest when full) and random sampling for minibatches. Test the buffer with random data and verify shapes and sampling behavior. Concept and real-world RL Experience replay stores past transitions and samples random minibatches for training. It breaks the correlation between consecutive samples (which would cause unstable updates if we trained only on the last transition) and reuses data for sample efficiency. DQN and many off-policy algorithms rely on it. The buffer is usually a circular buffer: when full, new transitions overwrite the oldest. Sampling uniformly at random (or with prioritization in advanced variants) gives unbiased minibatches. In practice, buffer size is a hyperparameter (e.g. 10k–1M); too small limits diversity, too large uses more memory and can slow learning if the policy has changed a lot. ...

Chapter 25: Target Networks

Learning objectives Implement hard target updates: copy online network parameters to the target network every \(N\) steps. Implement soft target updates: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) each step (or each update). Compare stability of Q-value estimates and learning curves for both update rules. Concept and real-world RL The target network in DQN provides a stable TD target: we use \(Q_{target}(s’,a’)\) instead of \(Q(s’,a’)\) so that the target does not change every time we update the online network, which would cause moving targets and instability. Hard update: copy full parameters every \(N\) steps (classic DQN). Soft update: slowly track the online network: \(\theta_{target} \leftarrow \tau \theta_{target} + (1-\tau) \theta_{online}\) with small \(\tau\) (e.g. 0.001). Soft updates change the target every step but by a small amount, often yielding smoother learning. Both are used in practice (e.g. DDPG uses soft updates). ...

Chapter 26: Double DQN (DDQN)

Learning objectives Implement Double DQN: use the online network to choose \(a^* = \arg\max_a Q_{online}(s’,a)\), then use \(Q_{target}(s’, a^*)\) as the TD target (instead of \(\max_a Q_{target}(s’,a)\)). Understand why this reduces overestimation of Q-values (max of estimates is biased high). Compare average Q-values and reward curves with standard DQN on CartPole. Concept and real-world RL Standard DQN uses \(y = r + \gamma \max_{a’} Q_{target}(s’,a’)\). The max over noisy estimates is biased upward (overestimation), which can hurt learning. Double DQN decouples action selection from evaluation: the online network selects \(a^\), the target network evaluates \(Q_{target}(s’, a^)\). This reduces overestimation and often improves stability and final performance. It is a small code change and is commonly used in modern DQN variants (e.g. Rainbow). ...