MDP
Gridworld discounted return from a sequence of actions.
Two-state MDP transition probability matrices.
The classic gridworld environment: states, actions, transitions, and terminal states.
How to design reward signals for MDPs and gridworld—shaping, terminal rewards, and step penalties.
Derive Bellman optimality equation for Q*(s,a).
Agent, environment, state, action, reward, Markov property, exploration-exploitation, and discount factor — with explanations.
Gridworld with wind: actions are shifted by a wind effect. Theory and code for policy evaluation and policy iteration.
10–15 questions on MDPs, Bellman, MC vs TD, SARSA vs Q-learning. Solutions included.
Simple stock MDP: buy/sell/hold; profit reward; Sharpe ratio.
15 short drill problems for Volume 1: discounted return, MDPs, value functions, Bellman equations, and dynamic programming.