Learning objectives
- Create a custom Gymnasium (or Gym) environment: inherit from
gym.Env, implementreset,step, and optionalrender. - Define
observation_spaceandaction_space(e.g. Discrete(4) for up/down/left/right). - Implement a text-based render (e.g. print a grid with agent and goal).
Concept and real-world RL
Real RL often requires custom environments: simulators for robotics, games, or domain-specific tasks. The Gym API (reset, step, observation_space, action_space) is the standard. Implementing a small maze teaches you how to encode state (e.g. agent position), handle boundaries and obstacles, and return (obs, reward, terminated, truncated, info). In practice, you will wrap or write envs for your problem and reuse the same agents (e.g. Q-learning, DQN) trained on standard envs.
Illustration (tabular Q size): For a 5×5 maze with 4 actions, the Q-table has states × actions entries. With 23 free cells (e.g. 2 walls), that is 23 × 4 = 92 entries. The chart below shows how Q-table size grows with maze size (states × 4 actions).
Exercise: Create a custom Gym environment for a 2D maze with obstacles. Define observation (agent position) and discrete actions (up, down, left, right). Implement a render function that prints a text-based map.
Professor’s hints
- Subclass
gymnasium.Env. In__init__, setself.observation_space(e.g.gymnasium.spaces.Boxfor position orDiscretefor cell index) andself.action_space = gymnasium.spaces.Discrete(4). reset(seed=None): set agent to start position, return(obs, info). Obs can be a tuple(row, col)or a flattened index; make it hashable or numpy for compatibility.step(action): compute next position (clip to grid, handle walls). If next cell is obstacle, stay and maybe give negative reward. If goal,terminated=True, positive reward. Return(obs, reward, terminated, truncated, info).render(): build a 2D grid of characters (e.g. ‘#’ wall, ‘.’ empty, ‘A’ agent, ‘G’ goal), print row by row. Usemode="human"or similar if required by the API.
Common pitfalls
- Gymnasium vs Gym: Gymnasium uses
terminatedandtruncated; Gym (old) used a singledone. Use both flags and setdone = terminated or truncatedin your training loop. - Observation type: Many algorithms expect a numpy array or a consistent type. Avoid returning a different shape in different states (e.g. terminal). Use a fixed obs space even for terminal (e.g. same shape, zero or last state).
- Action semantics: Document whether 0=up, 1=down, etc., and be consistent. In a 2D grid, “up” often means decrease row index.
Worked solution (warm-up: maze Q-table size)
Extra practice
- Warm-up: In your maze, how many possible observations (states) are there? How many actions? What is the size of a tabular Q-table?
- Coding: Implement a minimal Gym maze (e.g. 5×5, start (0,0), goal (4,4), walls optional). Implement reset() and step(action). Run a random policy for 10 episodes and log episode length and return.
- Challenge: Add a “time limit” (e.g. 100 steps): set
truncated=Truewhen step count exceeds 100. Return this ininfoas well. Run a random agent for one episode and confirm truncation occurs. - Variant: Change the reward function from a step penalty (−1 per step) to a dense distance-based reward (e.g. −Euclidean distance to goal each step). Run Q-learning and compare learning speed with the step-penalty version.
- Debug: The code below returns
done(a single boolean, old Gym API) instead of the Gymnasium-styleterminated, truncated. Fix it.
- Conceptual: What is the difference between
terminatedandtruncatedin the Gymnasium API? Why does this distinction matter for computing the TD target? - Recall: List the four required methods of a Gymnasium environment from memory.