Chapter 4: The Reward Hypothesis

Learning objectives State the reward hypothesis: that goals can be captured by scalar reward signals. Design a reward function for a concrete task and anticipate unintended behavior. Identify and fix “reward hacking” (exploiting the reward design instead of the intended goal). Concept and real-world RL The reward hypothesis says that we can capture what we want the agent to do by defining a scalar reward at each step; the agent’s goal is to maximize cumulative reward. In practice, reward design is hard: the agent will optimize exactly what you reward, so oversimplified or buggy rewards lead to reward hacking (e.g. the agent finds a loophole that yields high reward without achieving the real goal). Examples: a robot rewarded for “distance to goal” might push the goal; a game agent rewarded for “score” might find a way to increment score without playing. Self-driving, robotics, and game AI all require careful reward shaping and testing for exploits. ...

March 10, 2026 · 4 min · 709 words · codefrydev