CartPole

Learning objectives Understand the CartPole environment: state (cart position, velocity, pole angle, pole angular velocity), actions (left/right), and reward (+1 per step until termination). Implement a solution using linear function approximation (e.g. tile coding or simple features) and semi-gradient SARSA or Q-learning. Optionally solve with a small neural network (e.g. DQN-style) as in later chapters. What is CartPole? CartPole (also called Inverted Pendulum) is a classic control task in OpenAI Gym / Gymnasium. A pole is attached to a cart that moves on a track. The state is continuous: cart position \(x\), cart velocity \(\dot{x}\), pole angle \(\theta\), pole angular velocity \(\dot{\theta}\). Actions are discrete: 0 = push left, 1 = push right. Reward: +1 for every step until the episode ends. The episode ends when the pole angle goes outside a range (e.g. \(\pm 12°\)) or the cart leaves the track (if bounded), or after a max step count (e.g. 500). So the goal is to keep the pole upright as long as possible (maximize total reward = number of steps). ...

March 10, 2026 · 3 min · 451 words · codefrydev

Chapter 23: Deep Q-Networks (DQN)

Learning objectives Implement full DQN: Q-network, target network, replay buffer, \(\epsilon\)-greedy, and the TD loss (MSE to target \(r + \gamma \max_{a’} Q_{target}(s’,a’)\)). Update the target network periodically (e.g. every 100 steps) by copying the online Q-network. Train on CartPole and plot reward per episode. Concept and real-world RL DQN combines a neural network for Q-values with experience replay (store transitions, sample random minibatches to break correlation) and a target network (separate copy of the network used in the TD target, updated periodically, to stabilize learning). The agent acts \(\epsilon\)-greedy, stores \((s,a,r,s’,\text{done})\) in the buffer, and repeatedly samples a batch, computes targets using the target network, and updates the online network by minimizing MSE. DQN was the first major deep RL success (Atari) and is still a standard baseline for discrete-action tasks. ...

March 10, 2026 · 3 min · 545 words · codefrydev

Chapter 33: The REINFORCE Algorithm

Learning objectives Implement REINFORCE (Monte Carlo policy gradient): estimate \(\nabla_\theta J\) using the return \(G_t\) from full episodes. Use a neural network policy with softmax output for discrete actions (e.g. CartPole). Observe and explain the high variance of gradient estimates when using raw returns \(G_t\) (no baseline). Concept and real-world RL REINFORCE is the simplest policy gradient algorithm: run an episode under \(\pi_\theta\), compute the return \(G_t\) from each step, and update \(\theta\) with \(\theta \leftarrow \theta + \alpha \sum_t G_t \nabla_\theta \log \pi(a_t|s_t)\). It is on-policy and Monte Carlo (needs full episodes). The variance of \(G_t\) can be large, especially in long episodes, which makes learning slow or unstable. In game AI, REINFORCE is a baseline for more advanced methods (actor-critic, PPO); in robot control, it is rarely used alone because of sample efficiency and variance. Adding a baseline (e.g. state-value function) reduces variance without introducing bias. ...

March 10, 2026 · 3 min · 602 words · codefrydev

Chapter 36: Advantage Actor-Critic (A2C)

Learning objectives Implement A2C (Advantage Actor-Critic): actor updated with TD error as advantage, critic updated to minimize TD error. Use the TD error \(r + \gamma V(s’) - V(s)\) as the advantage (optionally with \(V(s’).detach()\)). Run multiple environments synchronously to collect a batch of transitions and update on the batch (reduces variance further). Concept and real-world RL A2C is the synchronous version of A3C: the agent runs \(N\) environments in parallel, collects a batch of transitions, and performs one update from the batch. The advantage is the TD error (or n-step return minus V(s)). Synchronous batching makes the updates more stable than fully asynchronous A3C. In game AI and robot control, A2C is a simple and effective baseline; it is often used with a shared feature extractor (one backbone, actor and critic heads) to save parameters and improve learning. ...

March 10, 2026 · 3 min · 566 words · codefrydev

Chapter 52: Learning World Models

Learning objectives Collect random trajectories from CartPole and train a neural network to predict the next state given (state, action). Evaluate prediction accuracy over 1 step, 5 steps, and 10 steps; observe compounding error as the horizon grows. Relate model error to the limitations of long-horizon model-based rollouts. Concept and real-world RL A world model (or dynamics model) predicts \(s_{t+1}\) from \(s_t, a_t\). We can train it on collected data (e.g. MSE loss). Errors compound over multi-step rollouts: a small 1-step error becomes large after many steps. In robot navigation, learned models are used for short-horizon planning; in game AI (e.g. Dreamer), models are used in latent space to reduce dimensionality and control rollouts. Understanding compounding error is key to designing model-based algorithms. ...

March 10, 2026 · 3 min · 442 words · codefrydev