Chapter 18: Custom Gym Environments (Part 1)

Learning objectives Create a custom Gymnasium (or Gym) environment: inherit from gym.Env, implement reset, step, and optional render. Define observation_space and action_space (e.g. Discrete(4) for up/down/left/right). Implement a text-based render (e.g. print a grid with agent and goal). Concept and real-world RL Real RL often requires custom environments: simulators for robotics, games, or domain-specific tasks. The Gym API (reset, step, observation_space, action_space) is the standard. Implementing a small maze teaches you how to encode state (e.g. agent position), handle boundaries and obstacles, and return (obs, reward, terminated, truncated, info). In practice, you will wrap or write envs for your problem and reuse the same agents (e.g. Q-learning, DQN) trained on standard envs. ...

March 10, 2026 · 3 min · 556 words · codefrydev

CartPole

Learning objectives Understand the CartPole environment: state (cart position, velocity, pole angle, pole angular velocity), actions (left/right), and reward (+1 per step until termination). Implement a solution using linear function approximation (e.g. tile coding or simple features) and semi-gradient SARSA or Q-learning. Optionally solve with a small neural network (e.g. DQN-style) as in later chapters. What is CartPole? CartPole (also called Inverted Pendulum) is a classic control task in OpenAI Gym / Gymnasium. A pole is attached to a cart that moves on a track. The state is continuous: cart position \(x\), cart velocity \(\dot{x}\), pole angle \(\theta\), pole angular velocity \(\dot{\theta}\). Actions are discrete: 0 = push left, 1 = push right. Reward: +1 for every step until the episode ends. The episode ends when the pole angle goes outside a range (e.g. \(\pm 12°\)) or the cart leaves the track (if bounded), or after a max step count (e.g. 500). So the goal is to keep the pole upright as long as possible (maximize total reward = number of steps). ...

March 10, 2026 · 3 min · 451 words · codefrydev

OpenAI Gym / Gymnasium

The curriculum uses Gym-style environments (e.g. Blackjack, Cliff Walking, CartPole, LunarLander). Gymnasium is the maintained fork of OpenAI Gym. The same API appears in many exercises: reset, step, observation and action spaces. Why Gym matters for RL API — env.reset() returns (obs, info); env.step(action) returns (obs, reward, terminated, truncated, info). Episodes run until terminated or truncated. Spaces — env.observation_space and env.action_space describe shape and type (Discrete, Box). You need them to build networks and to sample random actions. Wrappers — Record episode stats, normalize observations, stack frames, or limit time steps without changing the base env. Seeding — Reproducibility via env.reset(seed=42) and env.action_space.seed(42). Core concepts with examples Basic loop: reset and step 1 2 3 4 5 6 7 8 9 10 11 12 13 import gymnasium as gym env = gym.make("CartPole-v1") obs, info = env.reset(seed=42) done = False total_reward = 0 while not done: action = env.action_space.sample() obs, reward, terminated, truncated, info = env.step(action) done = terminated or truncated total_reward += reward env.close() print("Episode return:", total_reward) Inspecting spaces 1 2 3 4 5 6 print(env.observation_space) # Box(4,) for CartPole print(env.action_space) # Discrete(2) # Sample actions action = env.action_space.sample() # For Box (continuous): low, high, shape # env.observation_space.low, .high, .shape Multiple episodes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 n_episodes = 10 returns = [] for ep in range(n_episodes): obs, info = env.reset() done = False G = 0 while not done: action = env.action_space.sample() obs, reward, terminated, truncated, info = env.step(action) done = terminated or truncated G += reward returns.append(G) env.close() print("Mean return:", sum(returns) / len(returns)) Wrappers: record episode stats 1 2 3 4 5 6 7 from gymnasium.wrappers import RecordEpisodeStatistics env = gym.make("CartPole-v1") env = RecordEpisodeStatistics(env) obs, info = env.reset() # ... run episode ... # After step that ends episode, info may contain "episode": {"r": ..., "l": ...} Seeding for reproducibility 1 2 3 env.reset(seed=0) env.action_space.seed(0) # Same sequence of random actions and (with a deterministic env) same trajectory Exercises Exercise 1. Create a CartPole-v1 environment. Call reset(seed=42) and then take 10 random actions with action_space.sample(), calling step each time. Print the observation shape and the cumulative reward after 10 steps. Close the env. ...

March 10, 2026 · 5 min · 929 words · codefrydev