Chapter 49: Custom Gym Environments (Part 2)

Learning objectives

Create a custom Gym environment: a 2D point mass that must navigate to a goal while avoiding an obstacle.
Define continuous action (e.g. force in x and y) and a reward function (e.g. distance to goal, penalty for obstacle or boundary).
Test the environment with a SAC (or PPO) agent and verify that the agent can learn to reach the goal.

Concept and real-world RL

Custom environments let you model robot navigation, recommendation (state = user, action = item), or trading (state = market, action = trade). A 2D point mass is a minimal continuous control task: state = (x, y, vx, vy), action = (fx, fy), reward = -distance to goal + penalties. In robot control, similar point-mass or particle models are used for planning and RL; in game AI, custom envs are used for prototyping. Implementing the Gym interface (reset, step, observation_space, action_space) and testing with a known algorithm (SAC) validates the design.

Where you see this in practice: Research and industry often use custom Gym envs for domain-specific problems (warehouse robots, driving, dialogue).

Illustration (2D point mass state): A simple continuous control env might have state (x, y, vx, vy) and action (fx, fy). The chart below shows a trajectory in (x,y) over 20 steps (conceptual scatter).

Exercise: Create a custom continuous control environment: a 2D point mass that must navigate to a goal while avoiding an obstacle. Define a continuous action (force) and a reward function. Test your environment with a SAC agent.

Professor’s hints

Subclass gym.Env; implement reset() (return obs, info) and step(action) (return obs, reward, terminated, truncated, info). Set observation_space (e.g. Box(4,) for x,y,vx,vy) and action_space (Box(2,) for fx, fy, bounded).
Dynamics: e.g. \(x_{t+1} = x_t + v_x dt\), \(v_{x,t+1} = v_{x,t} + f_x dt\) (with clipping). Goal at (1,1), obstacle as a circle; reward = -distance_to_goal - 10 if in obstacle, or sparse (+1 at goal).
Test: run SAC for 50k steps; plot position (x,y) over time. Does the agent eventually reach the goal?

Common pitfalls

Reward shaping: Too much shaping can make the agent exploit loopholes; too sparse can make learning slow. Start simple (-distance) and add obstacle penalty.
Action scale: Clip or scale actions to a reasonable force; otherwise the point mass can shoot off.

Worked solution (warm-up: continuous control)

Key idea: In continuous control we output a distribution over actions (e.g. Gaussian with mean from the network and learned or fixed std). We sample \(a \sim \pi(\cdot|s)\), compute \(\nabla \log \pi(a|s)\), and use it with the advantage (e.g. TD error or GAE). For bounded action spaces we squash through tanh and add the log-Jacobian to the log-probability. SAC and PPO both support continuous actions this way.

Extra practice

Warm-up: What should the observation space and action space be for a 2D point mass? (State: position and velocity; action: force.)
Coding: Implement the env and run random actions for 100 steps. Check that (x,y) moves and that reward is computed. Then run SAC for 20k steps and plot the trajectory of (x,y) for one episode.
Challenge: Add a moving obstacle (e.g. oscillating circle). Does SAC still learn to reach the goal while avoiding it?
Variant: Replace the dense reward (−distance to goal) with a sparse reward (+1 only when the agent reaches the goal). How does learning speed change? Does SAC still learn within 50k steps?
Debug: The env below returns a single done boolean (old Gym API) instead of terminated, truncated (Gymnasium API), which breaks SAC’s TD target computation. Fix it.

Try it — edit and run (Shift+Enter)

Conceptual: Dense reward (e.g. −distance to goal) is faster to learn from than sparse reward, but can introduce reward hacking. Give one example of how a point-mass agent might exploit a dense distance reward.
Recall: List the five return values of env.step(action) in the Gymnasium API from memory.