The curriculum uses Gym-style environments (e.g. Blackjack, Cliff Walking, CartPole, LunarLander). Gymnasium is the maintained fork of OpenAI Gym. The same API appears in many exercises: reset, step, observation and action spaces.
Why Gym matters for RL
- API —
env.reset()returns(obs, info);env.step(action)returns(obs, reward, terminated, truncated, info). Episodes run untilterminated or truncated. - Spaces —
env.observation_spaceandenv.action_spacedescribe shape and type (Discrete, Box). You need them to build networks and to sample random actions. - Wrappers — Record episode stats, normalize observations, stack frames, or limit time steps without changing the base env.
- Seeding — Reproducibility via
env.reset(seed=42)andenv.action_space.seed(42).
Core concepts with examples
Basic loop: reset and step
| |
Inspecting spaces
| |
Multiple episodes
| |
Wrappers: record episode stats
| |
Seeding for reproducibility
| |
Exercises
Exercise 1. Create a CartPole-v1 environment. Call reset(seed=42) and then take 10 random actions with action_space.sample(), calling step each time. Print the observation shape and the cumulative reward after 10 steps. Close the env.
Exercise 2. Run 100 episodes of CartPole with a random policy (sample action each step). Store the return (sum of rewards) for each episode in a list. Compute and print the mean and standard deviation of returns. Use a fixed seed for reset and action_space so the result is reproducible.
Exercise 3. Inspect the observation and action spaces of “CartPole-v1” and “LunarLander-v2” (or LunarLanderContinuous-v2). Print the type (Discrete/Box), shape, and for Box the low/high bounds. Write a short comment on how you would size the input and output layers of a neural network for each.
Exercise 4. Implement a simple fixed policy for CartPole: if the cart position (obs[0]) is positive, take action 1; else take action 0. Run 20 episodes with this policy and record the return for each. Report the mean return. (This policy is poor; the exercise is just to practice using a non-random policy.)
Exercise 5. Write a function run_episode(env, policy, max_steps=500) that runs one episode: reset, then loop step until terminated, truncated, or max_steps. The policy is a callable policy(obs) -> action. Return the list of (obs, action, reward) for each step and the total return. Test with a random policy and with the fixed policy from Exercise 4.
Exercise 6. Run 50 episodes of CartPole with a random policy. Store the length (number of steps) of each episode. Compute the mean and max length. In RL: Episode length is often reported alongside return; for CartPole, longer is better.
Exercise 7. Create Blackjack (e.g. gym.make("Blackjack-v1")). Run 10 episodes with a random policy (sample from env.action_space). Print the observation shape and the meaning of the first few components (player sum, dealer card, usable ace) from the docs. In RL: Blackjack is used in the curriculum for Monte Carlo prediction.
Exercise 8. (Challenge) Write a wrapper that counts the number of steps per episode and, when the episode ends, prints “Episode finished in N steps, return R”. Use a class that holds env, overrides step to count and check terminated or truncated, and prints on done. In RL: Custom wrappers are used for logging, frame stacking, and reward shaping.
Professor’s hints
- Always set
done = terminated or truncated; Gymnasium uses both flags. Ignoringtruncated(e.g. time limit) can lead to wrong value estimates or infinite loops. - In RL: Seed both
env.reset(seed=...)andenv.action_space.seed(...)so the environment and your random actions are reproducible. Do this once per run or once per episode depending on what you want to reproduce. - Use
env.observation_space.shapeandenv.action_space.n(for Discrete) to size your neural network. For Box, useenv.observation_space.shape[0]for the state dimension. - Call
env.close()when you are done (e.g. after all episodes); some envs use resources that should be released.
Common pitfalls
- Using the old Gym API: Old Gym used
doneand 4 return values. Gymnasium uses 5:(obs, reward, terminated, truncated, info). Check the library version and docs. - Assuming obs is a numpy array: It usually is, but some envs return dicts or other types. Check
type(obs)andobs.shapebefore passing to a network. - Forgetting to handle truncation: If you only check
terminated, time-limited episodes may never “end” in your loop logic. Always usedone = terminated or truncated. - Not seeding: Without seeds, you cannot reproduce results or debug. Seed at the start of training and (if you want identical episodes) per episode.
Docs: gymnasium.farama.org. Used in Chapters 11–12 (Blackjack), 13–16 (Cliff Walking), 23+ (CartPole, etc.).