Use this self-check after completing Phase 0: Programming from Zero. If you can answer at least 8 of 10 correctly, you are ready to move on.


1. Predict the output

What does this print?

1
2
3
4
5
6
7
x = 5
if x > 3:
    print("A")
elif x > 1:
    print("B")
else:
    print("C")
Answer
A — The first condition x > 3 is True (5 > 3), so Python executes the first branch and skips the rest.

2. Write a function

Write clamp(x, lo, hi) that returns lo if x < lo, hi if x > hi, otherwise x. (Used in PPO to clip ratios.)

Try it — edit and run (Shift+Enter)
Answer
1
2
3
4
def clamp(x, lo, hi):
    if x < lo: return lo
    if x > hi: return hi
    return x

3. Find the bug

This should print the sum of a rewards list but has a bug. Find it.

1
2
3
4
5
rewards = [0.5, 0.3, 0.2]
total = 0
for r in rewards
    total = total + r
print(total)
Answer
Missing colon after for r in rewards. Fix: for r in rewards: (Python requires : after loop/if/def headers). SyntaxError without it.

4. Predict the output

1
2
3
4
q = [0.1, 0.5, 0.3]
print(q[1])
print(q[-1])
print(len(q))
Answer
0.5 (index 1), 0.3 (last element, index -1), 3 (length).

5. Write a function

Write discounted_return(rewards, gamma) that computes G = r₀ + γr₁ + γ²r₂ + ⋯. Test with rewards=[0,0,1], gamma=0.9 (expected: 0.81).

Try it — edit and run (Shift+Enter)
Answer
1
2
def discounted_return(rewards, gamma):
    return sum(gamma**t * r for t, r in enumerate(rewards))

6. Find the bug

1
2
3
4
5
def epsilon_greedy(Q, epsilon=0.1):
    import random
    if random.random() > epsilon:
        return random.randrange(len(Q))
    return Q.index(max(Q))
Answer
Condition is reversed: > epsilon means mostly exploring (wrong). Fix: if random.random() < epsilon: for exploration, else: return Q.index(max(Q)) for exploitation.

7. Predict the output

1
2
3
4
config = {"gamma": 0.99, "epsilon": 0.1}
config["lr"] = 0.001
print(len(config))
print("alpha" in config)
Answer
3 (three keys: gamma, epsilon, lr), False (“alpha” is not a key).

8. Write a function

Write max_q(Q_dict, state, n_actions) that returns the maximum Q-value for a given state. Q_dict maps (state, action) → float. Use Q_dict.get((state, a), 0.0) for each action.

Try it — edit and run (Shift+Enter)
Answer
1
2
def max_q(Q_dict, state, n_actions):
    return max(Q_dict.get((state, a), 0.0) for a in range(n_actions))

9. Find the bug

1
2
3
4
5
def run_episode(rewards, gamma=0.9):
    G = 0
    for t in range(len(rewards)):
        G = G + gamma * rewards[t]   # bug
    return G
Answer
The discount is wrong: gamma * rewards[t] should be gamma**t * rewards[t]. As written, every reward is discounted by exactly γ regardless of step. Fix: G += gamma**t * rewards[t].

10. Predict the output

1
2
3
4
5
6
def f(x, y=2):
    return x * y

print(f(3))
print(f(3, 3))
print(f(y=4, x=2))
Answer
6 (3×2), 9 (3×3), 8 (x=2, y=4 → 2×4). Default argument y=2 is used when not specified.

Score: 8–10: Ready for Phase 1. 6–7: Review the specific topics you missed. Below 6: Complete Phase 0 and the Python Confidence Builder before continuing.