MDP | Reinforcement Learning Curriculum

Gridworld discounted return from a sequence of actions.

Two-state MDP transition probability matrices.

The classic gridworld environment: states, actions, transitions, and terminal states.

How to design reward signals for MDPs and gridworld—shaping, terminal rewards, and step penalties.

Derive Bellman optimality equation for Q*(s,a).

Agent, environment, state, action, reward, Markov property, exploration-exploitation, and discount factor — with explanations.

Gridworld with wind: actions are shifted by a wind effect. Theory and code for policy evaluation and policy iteration.

10–15 questions on MDPs, Bellman, MC vs TD, SARSA vs Q-learning. Solutions included.

Simple stock MDP: buy/sell/hold; profit reward; Sharpe ratio.

15 short drill problems for Volume 1: discounted return, MDPs, value functions, Bellman equations, and dynamic programming.