This page covers the core RL framework you need for the preliminary assessment: the four main components, the Markov property, exploration vs exploitation, and the discount factor. Back to Preliminary.
Why this matters for RL
Every RL problem is defined by who acts (agent), what they interact with (environment), what they observe (state), what they can do (actions), and what feedback they get (reward). The Markov property and the discount factor shape how we define value functions and algorithms. Exploration vs exploitation is the central tension in learning from experience.
Learning objectives
Define agent, environment, action, reward, and state; state the Markov property and why it matters; give an example of the exploration-exploitation dilemma; explain \(\gamma=0\) vs \(\gamma=1\).
Core concepts
- Agent: The learner/decision maker. Environment: Everything outside the agent it interacts with. Action: A choice the agent can make. Reward: A scalar feedback signal (immediate desirability). State: A representation of the current situation the agent uses to choose actions.
- Markov property: The future is independent of the past given the present state. So the state summarizes all relevant history.
- Exploration vs exploitation: Exploit = use what you believe is best; explore = try something else to get more information. We must balance them.
- Discount factor \(\gamma\): Weights future rewards. \(\gamma=0\): only immediate reward matters. \(\gamma=1\): future rewards count equally (can lead to infinite returns in continuing tasks).
Illustration (discount factor): For a sequence of rewards \([0, 0, 1]\), the return from step 0 is \(0 + 0.9\cdot 0 + 0.81\cdot 1 = 0.81\) when \(\gamma=0.9\). The chart below shows how the return from step 0 changes as we include more steps (1, 2, 3 steps).
Worked problems (with explanations)
1. Four components and state (Q10)
Q: Define the four main components of a reinforcement learning system: agent, environment, action, reward. Also, what is a state?
These five terms are the vocabulary of RL. The agent observes a state, takes an action, receives a reward and a next state from the environment, and repeats. Every algorithm (value-based, policy-based, model-based) is built on this loop. The state is the “summary” that we use for value functions and policies (e.g. \(V(s)\), \(\pi(a|s)\)).Answer and explanation
Explanation
2. Markov property (Q11)
Q: What is the Markov property in the context of RL? Why is it important?
The Markov property states that the future is independent of the past given the present state. In other words, the current state contains all the information needed to predict future states and rewards; the history of states and actions before the current state does not add any extra information. It allows us to model the problem as a Markov Decision Process (MDP). Then the value function and optimal policy depend only on the current state, not on the full history. That makes the problem tractable: we don’t need to condition on infinitely long histories, and algorithms like value iteration and Q-learning are well defined. When the state is not fully observed (partial observability), we may need to use history or a belief state, but the ideal is still to have a state that is Markov.Answer and explanation
Why it’s important
3. Exploration vs exploitation (Q12)
Q: Give a real-world example of the exploration-exploitation dilemma and explain why it’s challenging.
Choosing a restaurant. Exploitation means going to a place you already know and like. Exploration means trying a new place that might be better (or worse). If you only exploit, you may never find a better option. If you only explore, you waste meals on bad choices. The challenge is balancing short-term satisfaction (exploit) with long-term discovery (explore). The agent must exploit what it believes is best to get high reward, but also explore to improve its estimates and discover better actions. Too little exploration leads to suboptimal policies; too much leads to slow learning or excessive risk. Algorithms like ε-greedy, UCB, and Thompson sampling are designed to manage this trade-off.Answer and explanation
Example
In RL
4. Discount factor (Q13)
Q: What is the purpose of a discount factor \(\gamma\) in RL? What happens when \(\gamma=0\) and when \(\gamma=1\) (in continuing tasks)?
The discount factor \(\gamma \in [0, 1]\) determines the present value of future rewards. The return is \(G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots\). So \(\gamma\) makes the sum finite in continuing tasks (no terminal state) and encodes time preference / uncertainty about the future. In practice we usually choose \(\gamma\) close to 1 (e.g. 0.99) so the agent cares about the long run but the sum still converges. The Bellman equations and all value-based algorithms depend on \(\gamma\).Answer and explanation
Explanation
Toy example: grid world
Consider a 2×2 grid. States: cells (1,1), (1,2), (2,1), (2,2). Actions: up, down, left, right (with boundaries blocking moves). Reward: +1 for reaching a goal cell, 0 otherwise. Agent: chooses action each step. Environment: returns next state and reward. If the next state and reward depend only on current state and action (and maybe a fixed transition noise), the state is Markov. The discount factor \(\gamma\) then weights how much we care about delayed reward (reaching the goal in few vs many steps).
Professor’s hints
- When you see “MDP,” think: states, actions, rewards, transition dynamics, and (often) \(\gamma\). The agent doesn’t need to know the dynamics; it can learn from experience.
- Exploration is necessary when we don’t know the best action. Once we know (or think we know) the best action, we can exploit.
- In continuing tasks, \(\gamma < 1\) is usually required so that the infinite sum of discounted rewards is finite.
Common pitfalls
- Confusing reward with return: Reward is per-step; return is the (discounted) sum of future rewards. Value functions are expectations of return, not of a single reward.
- Assuming full observability: The Markov property is about the state, not the raw observation. If the state is “what the agent sees” and that omits important information, the state is not Markov and we may need POMDPs or history.
- Using \(\gamma=1\) in continuing tasks without care: The return can be infinite; algorithms that assume bounded returns may break. Use \(\gamma < 1\) or average reward formulation.