Python

Concepts used in the curriculum and in Preliminary: Python basics: data structures (list, tuple, dict, set), classes and objects, functions, list comprehensions, loops, and conditionals. RL code is full of trajectories, configs, and custom types (agents, buffers)—all built from these basics. Data structures Choosing the right structure makes code clearer and often faster. In RL you’ll use all four constantly. List — ordered, mutable Use for sequences: trajectory of states, batch of indices, rewards per episode. ...

March 10, 2026 · 9 min · 1810 words · codefrydev

NumPy

Used in Preliminary: NumPy and throughout the curriculum for state/observation arrays, reward vectors, and batch operations. RL environments return observations as arrays; neural networks consume batches of arrays—NumPy is the standard bridge. Why NumPy matters for RL Arrays — States and observations are vectors or matrices; rewards over time are 1D arrays. np.zeros(), np.array(), np.arange() are used constantly. Indexing and slicing — Extract rows/columns, mask by condition, gather batches. Fancy indexing appears in replay buffers and minibatches. Broadcasting — Apply operations across shapes without writing loops (e.g. subtract mean from a batch). Random — np.random for \(\epsilon\)-greedy, environment stochasticity, and reproducible seeds. Math — np.sum, np.mean, dot products, element-wise ops. No need for Python loops over elements. Core concepts with examples Creating arrays 1 2 3 4 5 6 7 8 9 10 11 12 13 import numpy as np # Preallocate for states (e.g. 4D state for CartPole) state = np.zeros(4) state = np.array([0.1, -0.2, 0.05, 0.0]) # Grid of values (e.g. for value function over 2D grid) grid = np.zeros((3, 3)) grid[0] = [1, 2, 3] # Ranges and linspace steps = np.arange(0, 1000, 1) # 0, 1, ..., 999 x = np.linspace(0, 1, 11) # 11 points from 0 to 1 Shape, reshape, and batch dimension 1 2 3 arr = np.array([[1, 2], [3, 4], [5, 6]]) # shape (3, 2) batch = arr.reshape(1, 3, 2) # (1, 3, 2) for "1 sample" flat = arr.flatten() # (6,) Indexing and slicing 1 2 3 4 5 6 7 8 9 10 11 12 # Slicing: first two rows, all columns arr[:2, :] # Last row arr[-1, :] # Boolean mask: rows where first column > 2 mask = arr[:, 0] > 2 arr[mask] # Integer indexing: rows 0 and 2 arr[[0, 2], :] Broadcasting and element-wise ops 1 2 3 4 5 6 7 8 # Subtract mean from each column X = np.random.randn(32, 4) # 32 samples, 4 features X_centered = X - X.mean(axis=0) # Element-wise product (e.g. importance weights) a = np.array([1.0, 2.0, 0.5]) b = np.array([1.0, 1.0, 2.0]) a * b # array([1., 2., 1.]) Random and seeding 1 2 3 4 5 6 7 np.random.seed(42) # Unit Gaussian (for bandit rewards, noise) samples = np.random.randn(10) # Uniform [0, 1) u = np.random.rand(5) # Random integers in [low, high) action = np.random.randint(0, 4) # one of 0,1,2,3 Useful reductions 1 2 3 4 5 arr = np.array([[1, 2], [3, 4], [5, 6]]) arr.sum() # 21 arr.sum(axis=0) # [9, 12] arr.mean(axis=1) # [1.5, 3.5, 5.5] np.max(arr, axis=0) # [5, 6] Worked examples Example 1 — Discounted return (Exercise 7). Given rewards = np.array([0.0, 0.0, 1.0]) and gamma = 0.9, compute \(G_0 = r_0 + \gamma r_1 + \gamma^2 r_2\) using NumPy. ...

March 10, 2026 · 6 min · 1184 words · codefrydev

Pandas

Useful for logging training metrics (rewards per episode, loss curves), loading small datasets, and analyzing results. Many curriculum exercises ask you to “plot the sum of rewards per episode”—storing those in a DataFrame keeps things tidy and easy to export. Why Pandas matters for RL DataFrame — Tabular data: one row per episode or per step, columns like episode, reward, length, loss. Easy to filter, aggregate, and plot. Series — 1D data (e.g. reward per episode). Rolling mean, describe(), and plot. I/O — to_csv, read_csv for saving/loading runs and sharing results. Grouping and aggregation — Mean reward per run, per algorithm, or per seed. Core concepts with examples Building a DataFrame from lists 1 2 3 4 5 6 7 8 9 10 11 import pandas as pd episodes = list(range(10)) rewards = [1, 2, 3, 2, 4, 3, 5, 4, 6, 5] lengths = [10, 12, 15, 11, 20, 14, 22, 18, 25, 21] df = pd.DataFrame({ "episode": episodes, "reward": rewards, "length": lengths }) Basic columns and selection 1 2 3 4 5 6 7 8 # Add a column (e.g. running mean) df["reward_smooth"] = df["reward"].rolling(window=3, min_periods=1).mean() # Select rows where reward > 4 df[df["reward"] > 4] # Select columns df[["episode", "reward"]] Describe and summary stats 1 2 3 df["reward"].mean() df["reward"].std() df.describe() # count, mean, std, min, 25%, 50%, 75%, max per column Save and load 1 2 df.to_csv("rewards.csv", index=False) df_loaded = pd.read_csv("rewards.csv") Plotting from a DataFrame 1 2 3 4 5 import matplotlib.pyplot as plt df.plot(x="episode", y="reward", label="reward") df.plot(x="episode", y="reward_smooth", label="smoothed", ax=plt.gca()) plt.xlabel("Episode") plt.show() Exercises Exercise 1. Create a DataFrame with columns episode (0 to 99) and reward, where reward is 10 + episode * 0.1 + noise with noise from np.random.randn(100). Add a column reward_ma5 that is the 5-step moving average of reward. Use rolling(5, min_periods=1).mean(). ...

March 10, 2026 · 4 min · 764 words · codefrydev

Visualization & Plotting for RL

This page ties together when and what to plot in reinforcement learning, how to read common charts, and which tool to use: Matplotlib for Python scripts and notebooks, or Chart.js for interactive web demos and dashboards. Why visualization matters in RL RL training is noisy: a single run can look good or bad by chance. Plots let you see trends (is return going up?), variance (how stable is learning?), and comparisons (which algorithm or hyperparameter is better?). Every curriculum chapter that asks you to “plot the learning curve” is training you to diagnose and communicate results. ...

March 10, 2026 · 5 min · 889 words · codefrydev

Matplotlib

Used in many chapter exercises to plot average reward over time, value functions, policy comparisons, and hyperparameter heatmaps. A clear plot often reveals convergence or instability at a glance. Why Matplotlib matters for RL Line plots — Reward vs episode, loss vs step, value vs state. The default plt.plot(x, y). Multiple curves — Overlay several runs or algorithms; use label and legend(). Subplots — Several panels in one figure (e.g. reward, length, loss). Heatmaps — Value function over 2D state space; grid search over \(\alpha\) and \(\epsilon\). Saving — plt.savefig("curve.png", dpi=150) for reports and slides. Core concepts with examples Single line plot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import matplotlib.pyplot as plt import numpy as np episodes = np.arange(100) rewards = 0.1 * episodes + 0.5 + np.random.randn(100) * 0.5 plt.figure(figsize=(8, 4)) plt.plot(episodes, rewards, alpha=0.7, label="raw") plt.xlabel("Episode") plt.ylabel("Cumulative reward") plt.title("Learning curve") plt.legend() plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() Smoothed curve (moving average) 1 2 3 4 5 6 7 window = 10 smooth = np.convolve(rewards, np.ones(window)/window, mode="valid") x_smooth = np.arange(len(smooth)) plt.plot(episodes, rewards, alpha=0.3, label="raw") plt.plot(x_smooth, smooth, label=f"MA-{window}") plt.legend() plt.show() Subplots: two panels 1 2 3 4 5 6 7 8 9 fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(8, 6)) ax1.plot(episodes, rewards) ax1.set_ylabel("Reward") ax1.set_title("Reward per episode") ax2.plot(episodes, np.cumsum(rewards)) ax2.set_ylabel("Cumulative reward") ax2.set_xlabel("Episode") plt.tight_layout() plt.show() Heatmap (e.g. value function or grid search) 1 2 3 4 5 6 7 8 # 4x4 value grid V = np.random.randn(4, 4) plt.imshow(V, cmap="viridis") plt.colorbar(label="V(s)") plt.xlabel("col") plt.ylabel("row") plt.title("State value function") plt.show() Saving 1 2 plt.savefig("learning_curve.png", dpi=150, bbox_inches="tight") plt.close() Exercises Exercise 1. Plot a line of \(y = x^2\) for \(x\) in \([0, 5]\) with 50 points. Add labels “x” and “y”, a title “y = x²”, and a grid. Save the figure as parabola.png. ...

March 10, 2026 · 4 min · 803 words · codefrydev

PyTorch

Used in Preliminary: PyTorch basics and in the curriculum for DQN, policy gradients, actor-critic, PPO, and SAC. PyTorch’s define-by-run style and clear autograd make it a natural fit for custom RL loss functions. Why PyTorch matters for RL Tensors — States, actions, and batches are tensors. torch.tensor(), requires_grad=True, and .to(device) are daily use. Autograd — Policy gradient and value losses need gradients; backward() and .grad are central. nn.Module — Q-networks, policy networks, and critics are nn.Module subclasses; parameters are collected for optimizers. Optimizers — torch.optim.Adam, zero_grad(), loss.backward(), optimizer.step(). Device — Move model and data to GPU with .to(device) for faster training. Core concepts with examples Tensors and gradients 1 2 3 4 5 6 import torch x = torch.tensor(2.0, requires_grad=True) y = x**2 y.backward() print(x.grad) # 4.0 Batches and shapes 1 2 3 4 5 # Batch of 32 states, 4 features (e.g. CartPole) states = torch.randn(32, 4) # Linear layer: 4 -> 64 W = torch.randn(4, 64, requires_grad=True) out = states @ W # (32, 64) Simple MLP with nn.Module 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import torch.nn as nn class QNetwork(nn.Module): def __init__(self, state_dim=4, n_actions=2, hidden=64): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, n_actions), ) def forward(self, x): return self.net(x) q = QNetwork() s = torch.randn(8, 4) # batch 8 q_vals = q(s) # (8, 2) Training step (e.g. MSE loss) 1 2 3 4 5 6 optimizer = torch.optim.Adam(q.parameters(), lr=1e-3) targets = torch.randn(8, 2) loss = nn.functional.mse_loss(q_vals, targets) optimizer.zero_grad() loss.backward() optimizer.step() Device and CPU/GPU 1 2 3 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") q = q.to(device) states = states.to(device) Worked examples Example 1 — Autograd (Exercise 1). Create \(x = 3.0\) with requires_grad=True, compute \(y = x^3 + 2x\), call y.backward(), and verify x.grad. ...

March 10, 2026 · 5 min · 1052 words · codefrydev

TensorFlow

Alternative to PyTorch for implementing DQN, policy gradients, and other deep RL algorithms. The Keras API provides layers and optimizers; GradientTape gives full control over custom loss functions (e.g. policy gradient, CQL). Why TensorFlow matters for RL Keras API — tf.keras.Sequential, tf.keras.Model, layers (Dense, Conv2D). Quick prototyping of Q-networks and policies. Gradient tape — tf.GradientTape() records operations so you can compute gradients of any scalar with respect to trainable variables. Essential for policy gradient and custom losses. Optimizers — tf.keras.optimizers.Adam, apply_gradients. Device placement — GPU via tf.config when available. Core concepts with examples Dense layers and Sequential model 1 2 3 4 5 6 7 8 import tensorflow as tf model = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation="relu", input_shape=(4,)), tf.keras.layers.Dense(64, activation="relu"), tf.keras.layers.Dense(2), # Q-values for 2 actions ]) model.build(input_shape=(None, 4)) Forward pass and MSE loss 1 2 3 4 states = tf.random.normal((32, 4)) q_values = model(states) targets = tf.random.normal((32, 2)) loss = tf.reduce_mean((q_values - targets) ** 2) Training step with GradientTape 1 2 3 4 5 6 7 8 9 10 11 12 optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) @tf.function def train_step(states, targets): with tf.GradientTape() as tape: q_values = model(states) loss = tf.reduce_mean((q_values - targets) ** 2) grads = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(grads, model.trainable_variables)) return loss loss_val = train_step(states, targets) Subclassing for custom models 1 2 3 4 5 6 7 8 9 10 11 class QNetwork(tf.keras.Model): def __init__(self, n_actions=2): super().__init__() self.d1 = tf.keras.layers.Dense(64, activation="relu") self.d2 = tf.keras.layers.Dense(64, activation="relu") self.out = tf.keras.layers.Dense(n_actions) def call(self, x): x = self.d1(x) x = self.d2(x) return self.out(x) Exercises Exercise 1. Create a Sequential model with one hidden layer (64 units, ReLU) and output dimension 2. Build it with input_shape=(4,). Call model(tf.random.normal((10, 4))) and print the output shape. Then use model.summary() to inspect parameters. ...

March 10, 2026 · 4 min · 782 words · codefrydev

OpenAI Gym / Gymnasium

The curriculum uses Gym-style environments (e.g. Blackjack, Cliff Walking, CartPole, LunarLander). Gymnasium is the maintained fork of OpenAI Gym. The same API appears in many exercises: reset, step, observation and action spaces. Why Gym matters for RL API — env.reset() returns (obs, info); env.step(action) returns (obs, reward, terminated, truncated, info). Episodes run until terminated or truncated. Spaces — env.observation_space and env.action_space describe shape and type (Discrete, Box). You need them to build networks and to sample random actions. Wrappers — Record episode stats, normalize observations, stack frames, or limit time steps without changing the base env. Seeding — Reproducibility via env.reset(seed=42) and env.action_space.seed(42). Core concepts with examples Basic loop: reset and step 1 2 3 4 5 6 7 8 9 10 11 12 13 import gymnasium as gym env = gym.make("CartPole-v1") obs, info = env.reset(seed=42) done = False total_reward = 0 while not done: action = env.action_space.sample() obs, reward, terminated, truncated, info = env.step(action) done = terminated or truncated total_reward += reward env.close() print("Episode return:", total_reward) Inspecting spaces 1 2 3 4 5 6 print(env.observation_space) # Box(4,) for CartPole print(env.action_space) # Discrete(2) # Sample actions action = env.action_space.sample() # For Box (continuous): low, high, shape # env.observation_space.low, .high, .shape Multiple episodes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 n_episodes = 10 returns = [] for ep in range(n_episodes): obs, info = env.reset() done = False G = 0 while not done: action = env.action_space.sample() obs, reward, terminated, truncated, info = env.step(action) done = terminated or truncated G += reward returns.append(G) env.close() print("Mean return:", sum(returns) / len(returns)) Wrappers: record episode stats 1 2 3 4 5 6 7 from gymnasium.wrappers import RecordEpisodeStatistics env = gym.make("CartPole-v1") env = RecordEpisodeStatistics(env) obs, info = env.reset() # ... run episode ... # After step that ends episode, info may contain "episode": {"r": ..., "l": ...} Seeding for reproducibility 1 2 3 env.reset(seed=0) env.action_space.seed(0) # Same sequence of random actions and (with a deterministic env) same trajectory Exercises Exercise 1. Create a CartPole-v1 environment. Call reset(seed=42) and then take 10 random actions with action_space.sample(), calling step each time. Print the observation shape and the cumulative reward after 10 steps. Close the env. ...

March 10, 2026 · 5 min · 929 words · codefrydev

Other Libraries

Optional tools you may encounter or use alongside the curriculum: JAX for fast autograd and JIT, Stable-Baselines3 for ready-made algorithms, and Weights & Biases for experiment tracking. No need to master these before starting; refer back when an exercise or chapter mentions them. JAX What: Autograd and JIT compilation; functional style; used in research (Brax, RLax, many papers). Concepts: jax.grad, jax.jit, jax.vmap, arrays similar to NumPy. GPU/TPU without explicit device code. When: Chapters or papers that use JAX-based envs or algorithms. Docs: jax.readthedocs.io. ...

March 10, 2026 · 4 min · 654 words · codefrydev

Machine Learning and AI Prerequisite Roadmap (pt 1–2)

Learning objectives See the recommended order of topics before (or alongside) RL: math, programming, optional supervised learning. Know what this curriculum assumes and where to fill gaps. Prerequisite roadmap (overview) Pt 1 — Foundations Programming: Variables, types, conditionals, loops, functions, basic data structures (lists, dicts). Language: Python. If you have no programming, start with the Learning path Phase 0 and Prerequisites: Python. Probability and statistics: Sample mean, variance, expectation, law of large numbers. Used in bandits, Monte Carlo, and value functions. See Math for RL: Probability. Linear algebra: Vectors, dot product, matrices, matrix-vector product. Used in value approximation \(V(s) = w^T \phi(s)\) and gradients. See Math for RL: Linear algebra. Calculus: Derivatives, chain rule, partial derivatives. Used in policy gradients and loss minimization. See Math for RL: Calculus. NumPy (and optionally Pandas, Matplotlib): Arrays, indexing, random numbers, plotting. See Prerequisites: NumPy, Matplotlib, Pandas. Pt 2 — Toward deep RL ...

March 10, 2026 · 2 min · 320 words · codefrydev