OpenAI Gym / Gymnasium

The curriculum uses Gym-style environments (e.g. Blackjack, Cliff Walking, CartPole, LunarLander). Gymnasium is the maintained fork of OpenAI Gym. The same API appears in many exercises: reset, step, observation and action spaces. Why Gym matters for RL API — env.reset() returns (obs, info); env.step(action) returns (obs, reward, terminated, truncated, info). Episodes run until terminated or truncated. Spaces — env.observation_space and env.action_space describe shape and type (Discrete, Box). You need them to build networks and to sample random actions. Wrappers — Record episode stats, normalize observations, stack frames, or limit time steps without changing the base env. Seeding — Reproducibility via env.reset(seed=42) and env.action_space.seed(42). Core concepts with examples Basic loop: reset and step 1 2 3 4 5 6 7 8 9 10 11 12 13 import gymnasium as gym env = gym.make("CartPole-v1") obs, info = env.reset(seed=42) done = False total_reward = 0 while not done: action = env.action_space.sample() obs, reward, terminated, truncated, info = env.step(action) done = terminated or truncated total_reward += reward env.close() print("Episode return:", total_reward) Inspecting spaces 1 2 3 4 5 6 print(env.observation_space) # Box(4,) for CartPole print(env.action_space) # Discrete(2) # Sample actions action = env.action_space.sample() # For Box (continuous): low, high, shape # env.observation_space.low, .high, .shape Multiple episodes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 n_episodes = 10 returns = [] for ep in range(n_episodes): obs, info = env.reset() done = False G = 0 while not done: action = env.action_space.sample() obs, reward, terminated, truncated, info = env.step(action) done = terminated or truncated G += reward returns.append(G) env.close() print("Mean return:", sum(returns) / len(returns)) Wrappers: record episode stats 1 2 3 4 5 6 7 from gymnasium.wrappers import RecordEpisodeStatistics env = gym.make("CartPole-v1") env = RecordEpisodeStatistics(env) obs, info = env.reset() # ... run episode ... # After step that ends episode, info may contain "episode": {"r": ..., "l": ...} Seeding for reproducibility 1 2 3 env.reset(seed=0) env.action_space.seed(0) # Same sequence of random actions and (with a deterministic env) same trajectory Exercises Exercise 1. Create a CartPole-v1 environment. Call reset(seed=42) and then take 10 random actions with action_space.sample(), calling step each time. Print the observation shape and the cumulative reward after 10 steps. Close the env. ...

March 10, 2026 · 5 min · 929 words · codefrydev