OpenAI Gym / Gymnasium

The curriculum uses Gym-style environments (e.g. Blackjack, Cliff Walking, CartPole, LunarLander). Gymnasium is the maintained fork of OpenAI Gym. The same API appears in many exercises: reset, step, observation and action spaces.

Why Gym matters for RL

API — env.reset() returns (obs, info); env.step(action) returns (obs, reward, terminated, truncated, info). Episodes run until terminated or truncated.
Spaces — env.observation_space and env.action_space describe shape and type (Discrete, Box). You need them to build networks and to sample random actions.
Wrappers — Record episode stats, normalize observations, stack frames, or limit time steps without changing the base env.
Seeding — Reproducibility via env.reset(seed=42) and env.action_space.seed(42).

Core concepts with examples

Basic loop: reset and step

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import gymnasium as gym

env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)
done = False
total_reward = 0
while not done:
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    total_reward += reward
env.close()
print("Episode return:", total_reward)

Inspecting spaces

1
2
3
4
5
6
print(env.observation_space)   # Box(4,) for CartPole
print(env.action_space)        # Discrete(2)
# Sample actions
action = env.action_space.sample()
# For Box (continuous): low, high, shape
# env.observation_space.low, .high, .shape

Multiple episodes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
n_episodes = 10
returns = []
for ep in range(n_episodes):
    obs, info = env.reset()
    done = False
    G = 0
    while not done:
        action = env.action_space.sample()
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        G += reward
    returns.append(G)
env.close()
print("Mean return:", sum(returns) / len(returns))

Wrappers: record episode stats

1
2
3
4
5
6
7
from gymnasium.wrappers import RecordEpisodeStatistics

env = gym.make("CartPole-v1")
env = RecordEpisodeStatistics(env)
obs, info = env.reset()
# ... run episode ...
# After step that ends episode, info may contain "episode": {"r": ..., "l": ...}

Seeding for reproducibility

1
2
3
env.reset(seed=0)
env.action_space.seed(0)
# Same sequence of random actions and (with a deterministic env) same trajectory

Exercises

Exercise 1. Create a CartPole-v1 environment. Call reset(seed=42) and then take 10 random actions with action_space.sample(), calling step each time. Print the observation shape and the cumulative reward after 10 steps. Close the env.

Exercise 2. Run 100 episodes of CartPole with a random policy (sample action each step). Store the return (sum of rewards) for each episode in a list. Compute and print the mean and standard deviation of returns. Use a fixed seed for reset and action_space so the result is reproducible.

Exercise 3. Inspect the observation and action spaces of “CartPole-v1” and “LunarLander-v2” (or LunarLanderContinuous-v2). Print the type (Discrete/Box), shape, and for Box the low/high bounds. Write a short comment on how you would size the input and output layers of a neural network for each.

Exercise 4. Implement a simple fixed policy for CartPole: if the cart position (obs[0]) is positive, take action 1; else take action 0. Run 20 episodes with this policy and record the return for each. Report the mean return. (This policy is poor; the exercise is just to practice using a non-random policy.)

Exercise 5. Write a function run_episode(env, policy, max_steps=500) that runs one episode: reset, then loop step until terminated, truncated, or max_steps. The policy is a callable policy(obs) -> action. Return the list of (obs, action, reward) for each step and the total return. Test with a random policy and with the fixed policy from Exercise 4.

Exercise 6. Run 50 episodes of CartPole with a random policy. Store the length (number of steps) of each episode. Compute the mean and max length. In RL: Episode length is often reported alongside return; for CartPole, longer is better.

Exercise 7. Create Blackjack (e.g. gym.make("Blackjack-v1")). Run 10 episodes with a random policy (sample from env.action_space). Print the observation shape and the meaning of the first few components (player sum, dealer card, usable ace) from the docs. In RL: Blackjack is used in the curriculum for Monte Carlo prediction.

Exercise 8. (Challenge) Write a wrapper that counts the number of steps per episode and, when the episode ends, prints “Episode finished in N steps, return R”. Use a class that holds env, overrides step to count and check terminated or truncated, and prints on done. In RL: Custom wrappers are used for logging, frame stacking, and reward shaping.

Professor’s hints

Always set done = terminated or truncated; Gymnasium uses both flags. Ignoring truncated (e.g. time limit) can lead to wrong value estimates or infinite loops.
In RL: Seed both env.reset(seed=...) and env.action_space.seed(...) so the environment and your random actions are reproducible. Do this once per run or once per episode depending on what you want to reproduce.
Use env.observation_space.shape and env.action_space.n (for Discrete) to size your neural network. For Box, use env.observation_space.shape[0] for the state dimension.
Call env.close() when you are done (e.g. after all episodes); some envs use resources that should be released.

Common pitfalls

Using the old Gym API: Old Gym used done and 4 return values. Gymnasium uses 5: (obs, reward, terminated, truncated, info). Check the library version and docs.
Assuming obs is a numpy array: It usually is, but some envs return dicts or other types. Check type(obs) and obs.shape before passing to a network.
Forgetting to handle truncation: If you only check terminated, time-limited episodes may never “end” in your loop logic. Always use done = terminated or truncated.
Not seeding: Without seeds, you cannot reproduce results or debug. Seed at the start of training and (if you want identical episodes) per episode.

Docs: gymnasium.farama.org. Used in Chapters 11–12 (Blackjack), 13–16 (Cliff Walking), 23+ (CartPole, etc.).

Additional exercises

Micro-exercises

Run env = gym.make('CartPole-v1'); obs, _ = env.reset(); print(obs) — how many values in the observation?
env.action_space — print it. How many actions does CartPole have?
env.observation_space — print it. What are the observation bounds?
Take 5 random steps: for _ in range(5): obs, r, done, trunc, info = env.step(env.action_space.sample()); print(r, done)
What does done=True mean in CartPole?

Build something small

Write a random_agent(env, n_steps=100) that takes random actions and returns the total reward.
Run CartPole for 10 episodes using a random agent. Print the episode lengths.
Access env.observation_space.low and env.observation_space.high — print the bounds for each observation dimension.

Mini-project: Episode statistics

Run 50 episodes on CartPole-v1 with a random agent. Compute: mean episode length, max episode length, % of episodes that lasted >50 steps. Plot the episode lengths as a histogram.

Try it — edit and run (Shift+Enter)

Why Gym matters for RL#

Core concepts with examples#

Basic loop: reset and step#

Inspecting spaces#

Multiple episodes#

Wrappers: record episode stats#

Seeding for reproducibility#

Exercises#

Professor’s hints#

Common pitfalls#

Additional exercises#

Micro-exercises#

Build something small#

Mini-project: Episode statistics#