The Perceptron: Learning from Mistakes

Learning objectives

State the perceptron learning rule and implement it in NumPy.
Train a perceptron on the AND gate and verify it achieves 100% accuracy.
Explain why XOR is not linearly separable and why this requires multi-layer networks.

Concept and real-world motivation

Frank Rosenblatt’s perceptron (1958) is the simplest possible neural network: one neuron, one layer, one learning rule. The perceptron takes binary inputs, computes a weighted sum plus bias, and outputs 1 if the result is positive, 0 otherwise. It learns by correcting its mistakes: whenever it predicts wrongly, it nudges its weights in the direction that would have produced the correct answer.

The perceptron learning rule updates weights only on misclassified examples:

\[w \leftarrow w + \alpha (y - \hat{y}) x\] \[b \leftarrow b + \alpha (y - \hat{y})\]

where \(\alpha\) is the learning rate, \(y\) is the true label, and \(\hat{y}\) is the predicted label. When the prediction is correct, \(y - \hat{y} = 0\) and nothing changes. When wrong, the weight moves toward the correct classification.

The perceptron can learn any linearly separable problem — one where a straight line (or hyperplane in higher dimensions) separates the two classes. AND and OR are linearly separable. XOR is not. No single line can separate the XOR truth table into its two classes, which is why single-layer perceptrons cannot solve XOR. The fix is multiple layers — exactly what we build in the next page. In RL, this is why Q-networks need hidden layers: the mapping from state to Q-value is almost never linearly separable.

Illustration:

Exercise: Implement the perceptron learning rule on the AND gate. Initialize all weights to zero, train for 10 epochs over the 4 training examples, and print the final weights and accuracy.

Try it — edit and run (Shift+Enter)

Professor’s hints

Initialize w = np.zeros(2) and b = 0.0.
The predict function: z = np.dot(w, x) + b; return 1 if z > 0 else 0.
The update: w += alpha * (y[i] - y_hat) * X[i] and b += alpha * (y[i] - y_hat).
The perceptron convergence theorem guarantees convergence in finite steps for linearly separable data. AND is linearly separable, so 10 epochs is more than enough.
After training, typical values are w ≈ [0.4, 0.4], b ≈ -0.6 (exact values depend on update order).

Common pitfalls

Updating on every example, not just mistakes: The perceptron only updates when it’s wrong. The condition is if y_hat != y[i], not unconditional.
Using wrong learning rate: For binary perceptrons, alpha = 1.0 is standard. Fractional learning rates just slow convergence without changing the final result for linearly separable data.
Expecting exact weight values: Many weight configurations correctly classify AND — the perceptron finds one valid solution, not a unique one.

Worked solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])

w = np.zeros(2)
b = 0.0
alpha = 1.0

def predict(x, w, b):
    return 1 if np.dot(w, x) + b > 0 else 0

for epoch in range(10):
    for i in range(len(X)):
        y_hat = predict(X[i], w, b)
        if y_hat != y[i]:
            w += alpha * (y[i] - y_hat) * X[i]
            b += alpha * (y[i] - y_hat)

preds = np.array([predict(X[i], w, b) for i in range(len(X))])
print(f'Final weights: w={w}, b={b}')   # w=[1. 1.], b=-1.0 (or similar)
print(f'Accuracy: {np.mean(preds == y):.0%}')  # 100%

The perceptron finds a separating hyperplane: the line \(x_1 + x_2 = 1\) (or equivalently \(w_1 x_1 + w_2 x_2 - 1 > 0\)). Only the point (1,1) lies on the positive side, so AND is learned correctly.

Extra practice

Warm-up: Run the perceptron on OR gate data: \(y = [0, 1, 1, 1]\). Does it converge? Print the final accuracy.

Try it — edit and run (Shift+Enter)

Coding: Try the XOR gate: y_xor = [0, 1, 1, 0]. Run the same perceptron for 100 epochs. Print accuracy every 20 epochs. Show that it never exceeds 75%.

Try it — edit and run (Shift+Enter)

Challenge: Draw the 2D decision boundary for a perceptron trained on AND. The boundary is the line \(w_1 x_1 + w_2 x_2 + b = 0\). For the weights you found, write this as \(x_2 = f(x_1)\). Does this line correctly separate the AND truth table?
Variant: Implement the perceptron on a 3-input AND gate: inputs are all 8 combinations of 3 bits, and output is 1 only when all three are 1. Does it still converge in 10 epochs?
Debug: The perceptron below has a wrong sign in the update rule — it moves weights in the wrong direction. Find and fix it.

Try it — edit and run (Shift+Enter)

Conceptual: The perceptron convergence theorem states that if the data is linearly separable, the perceptron will converge in a finite number of steps. What does “linearly separable” mean geometrically for 2D inputs? Draw a sketch and mark which truth tables (AND, OR, NAND, XOR) are linearly separable.
Recall: Write the perceptron update rule from memory. Then state the condition under which the update fires (when does the perceptron update its weights?). What happens when the perceptron predicts correctly?