Learning objectives
- Understand the CartPole environment: state (cart position, velocity, pole angle, pole angular velocity), actions (left/right), and reward (+1 per step until termination).
- Implement a solution using linear function approximation (e.g. tile coding or simple features) and semi-gradient SARSA or Q-learning.
- Optionally solve with a small neural network (e.g. DQN-style) as in later chapters.
What is CartPole?
CartPole (also called Inverted Pendulum) is a classic control task in OpenAI Gym / Gymnasium. A pole is attached to a cart that moves on a track. The state is continuous: cart position \(x\), cart velocity \(\dot{x}\), pole angle \(\theta\), pole angular velocity \(\dot{\theta}\). Actions are discrete: 0 = push left, 1 = push right. Reward: +1 for every step until the episode ends. The episode ends when the pole angle goes outside a range (e.g. \(\pm 12°\)) or the cart leaves the track (if bounded), or after a max step count (e.g. 500). So the goal is to keep the pole upright as long as possible (maximize total reward = number of steps).
Why use CartPole?
- Simple but non-trivial: Small state space (4D), 2 actions. Good for testing function approximation (linear or small neural net) and for debugging RL code.
- Standard benchmark: Widely used in tutorials and papers; easy to compare with others.
- Continuous state: You cannot enumerate states; you need function approximation (tile coding, linear features, or neural net).
CartPole in code
Environment: Use gym.make("CartPole-v1") (or Gymnasium equivalent). env.reset() returns the initial state (4-dim array). env.step(action) returns obs, reward, terminated, truncated, info. Reward is 1.0 per step; episode ends when terminated or truncated is True.
Solving with linear FA: Discretize or featurize the state (e.g. tile coding over the 4 dimensions, or hand-crafted features like \([\theta, \dot{\theta}, x, \dot{x}]\) with scaling). Use semi-gradient SARSA or Q-learning with epsilon-greedy. Train for many episodes; plot total reward per episode. A good solution can reach 500 steps (or the environment max) consistently.
Solving with a neural network: Use a small MLP that takes the state and outputs Q(s,a) for each action (or one output per action). Train with DQN-style updates (experience replay, target network) as in Chapter 23: DQN. CartPole is small enough that even a simple 2-layer net can solve it quickly.
CartPole code (minimal sketch)
| |
See Chapter 21: Linear Function Approximation for tile coding and semi-gradient methods, Feature Engineering for designing features, and Chapter 23: DQN for a neural net approach.