Neural Networks¶

Overview¶

Biological neurons (billions in the brain) fire spike trains down axons across synapses. Artificial neural networks use a cartoon abstraction: tunable units that compute via weighted sums and activation functions, trained by learning rules.

Perceptron Unit¶

Simplified neuron model:

Inputs \(x_1, x_2, x_3, \ldots\) — firing rates / input strengths
Weights \(w_1, w_2, w_3, \ldots\) — gain / sensitivity per input
Activation: \(a = \sum_i w_i x_i\)
Output: compare activation to firing threshold \(\theta\)

\[\begin{split}y = \begin{cases} 1 & \text{if } a \geq \theta \\ 0 & \text{otherwise} \end{cases}\end{split}\]

The whole unit = linear sum + threshold → a Perceptron.

Worked Example

Inputs \((1, 0, -1.5)\), weights \((\tfrac{1}{2}, \tfrac{3}{5}, 1)\), threshold \(\theta = 0\):

\[a = \tfrac{1}{2}(1) + \tfrac{3}{5}(0) + 1(-1.5) = \tfrac{1}{2} - \tfrac{3}{2} = -1\]

Activation is negative → \(y = \mathbf{0}\) (not above threshold).

Perceptron = Half-Plane¶

With 2 inputs \((x_1, x_2)\), the perceptron divides the plane:

Example: \(w_1 = w_2 = \tfrac{1}{2}\), \(\theta = \tfrac{3}{4}\)
If \(x_1 = 0\): need \(x_2 \geq 1.5\) to fire → dividing line at \(x_2 = 1.5\)
If \(x_2 = 0\): need \(x_1 \geq 1.5\)
Solving the linear inequality → a line dividing 0-region from 1-region

Key properties:

A perceptron computes a half-plane (hyperplane in higher dimensions)
Perceptrons are linear classifiers
Always a linear dividing boundary where activation equals threshold
In 3+ dimensions: dividing surface is a hyperplane; output regions are half-spaces

Visualizing the half-plane (\(w_1 = w_2 = \tfrac{1}{2}\), \(\theta = \tfrac{3}{4}\)):

Boundary line: \(w_1 x_1 + w_2 x_2 = \theta\) → \(x_1 + x_2 = 1.5\)
Points with \(x_1 + x_2 \geq 1.5\) → output 1
Points below the line → output 0
Generalizes: any perceptron defines a linear decision boundary

Boolean Functions via Perceptrons¶

Restrict inputs/outputs to \(\{0,1\}\) → Boolean functions (zero = false, one = true).

AND

With \(w_1 = w_2 = \tfrac{1}{2}\), \(\theta = \tfrac{3}{4}\) and binary inputs only:

\(x_1\) | \(x_2\) | Above line? | Output |

|------------|————|-------------|——–| | 0 | 0 | no | 0 | | 0 | 1 | no | 0 | | 1 | 0 | no | 0 | | 1 | 1 | yes | 1 |

→ Computes conjunction (AND).

OR

Want \(y = 1\) if either input is 1. Many valid weight/threshold pairs:

\(w_1 = w_2 = 1\), \(\theta = 1\) — one input alone reaches threshold
Keep original weights, lower \(\theta\) to \(\tfrac{1}{2}\) (or any value in \((0, 1]\))

NOT (single variable)

\(x_1\) | Desired output |

|------------|—————-| | 0 | 1 | | 1 | 0 |

Solution: \(w_1 = -1\), \(\theta = 0\) — negative weight flips the sign; threshold at 0 exploits \(\geq\).

Key result: AND, OR, NOT all expressible as single perceptrons → any Boolean function composable from these.

XOR Requires a Network¶

Single perceptron cannot compute XOR — not linearly separable (the “evil” parity function).

Two-unit network:

Inputs: \(x_1, x_2\)
Unit 1: AND gate (known weights/threshold from above)
Unit 2: inputs \(x_1, x_2\), AND output → computes XOR

Truth table insight:

\(x_1\) | \(x_2\) | AND | OR | XOR |

|------------|————|-----|—-|----| | 0 | 0 | 0 | 0 | 0 | | 0 | 1 | 0 | 1 | 1 | | 1 | 0 | 0 | 1 | 1 | | 1 | 1 | 1 | 1 | 0 |

XOR = OR except at \((1,1)\). Unit 2 should compute OR − 2·AND:

OR weights: \(w_{x_1} = w_{x_2} = 1\), \(\theta = 1\)
Subtract AND with \(w_{\text{AND}} = -2\) (not −1 — when both on, sum = 2 − 2 = 0 < threshold)

Step-by-step for unit 2 (OR minus 2·AND):

Set OR weights: \(w_{x_1} = w_{x_2} = 1\), \(\theta = 1\) → fires if either input on
Add \(w_{\text{AND}} = -1\): fixes \((1,1)\) case? No — sum = 2 − 1 = 1 ≥ threshold still
Set \(w_{\text{AND}} = -2\): sum at \((1,1)\) = 2 − 2 = 0 < threshold → output 0 ✓
When AND off, unit acts as pure OR → correct for other three rows

Infinite solutions exist for XOR networks.

Perceptron Training¶

Hand-set weights are impractical. Need learning from training set: examples \((x, y)\) where \(x\) is a vector, \(y \in \{0,1\}\).

Two learning rules covered:

Perceptron Rule — uses thresholded outputs
Gradient Descent / Delta Rule — uses unthresholded activations

Both modify weights iteratively over the training set.

The Bias Trick¶

Instead of learning \(\theta\) separately, fold it into weights:

Add bias unit with input always 1
Weight on bias = \(-\theta\)
Compare activation to 0 instead of \(\theta\)

This unifies threshold and weight learning.

Perceptron Rule¶

For each training example \((x, y)\):

\[\begin{split}\hat{y} = \begin{cases} 1 & \text{if } w \cdot x \geq 0 \\ 0 & \text{otherwise} \end{cases}\end{split}\]

\[\Delta w_i = \eta \,(y - \hat{y})\, x_i\]

\(\eta\) = learning rate (small step to avoid overshooting).

\(y\) | \(\hat{y}\) | \(y - \hat{y}\) | Weight change |

|----------|—————-|---------------------|—————| | 0 | 0 | 0 | none (correct) | | 1 | 1 | 0 | none (correct) | | 0 | 1 | −1 | decrease (sum too large) | | 1 | 0 | +1 | increase (sum too small) |

When wrong: large \(x_i\) values cause larger weight adjustments in the corrective direction.

Training loop (while error exists):

for each (x, y) in training set:
    compute y_hat = threshold(w · x)
    for each weight i:
        w_i ← w_i + η (y - y_hat) x_i
stop when all examples classified correctly

Linear Separability¶

If a half-plane separates positive (green) from negative (red) examples → data is linearly separable
Perceptron Rule guarantee: converges in finite iterations for linearly separable data
After convergence: \(y - \hat{y} = 0\) always → weights stop changing
If not separable: algorithm never stops (unless you add a stop-on-zero-error condition)
Cannot generally detect non-separability — would require solving halting problem in worst case
Practical check: run algorithm; if it stops → separable; if not after long run → inconclusive

Gradient Descent (Delta Rule)¶

Need a learning algorithm robust to non-linear separability.

Instead of thresholding during training, minimize error on continuous activation \(a = w \cdot x\).

Error Metric¶

\[E(w) = \tfrac{1}{2} \sum_{d \in D} \bigl(y_d - a_d\bigr)^2\]

The \(\tfrac{1}{2}\) is a constant for minimization but cancels the exponent in the derivative.

Partial Derivative¶

Chain rule on \(E\) w.r.t. \(w_i\):

\[\frac{\partial E}{\partial w_i} = \sum_{d \in D} (a_d - y_d)\, x_{d,i}\]

Only the term matching \(w_i\) survives (other weights’ derivatives = 0).

Weight Update¶

Move opposite to gradient:

\[\Delta w_i = \eta \,(y - a)\, x_i\]

Derivation sketch:

\[E = \tfrac{1}{2} \sum_d (y_d - a_d)^2, \quad a_d = \sum_j w_j x_{d,j}\]

\[\frac{\partial E}{\partial w_i} = \sum_d (a_d - y_d)\, x_{d,i} \quad \Rightarrow \quad \Delta w_i = -\eta \frac{\partial E}{\partial w_i} = \eta (y - a)\, x_i\]

Note: derivative of \(w_i\) w.r.t. terms where \(j \neq i\) is zero — only matching input \(x_i\) contributes.

Comparison of Learning Rules¶

Why Not Gradient Descent on \(\hat{y}\)?¶

We want \(\hat{y}\) (thresholded output) to match target, not raw activation \(a\). But gradient descent on \(\hat{y}\) fails because:

Possible reason | Valid? |

|----------------|——–| | Computationally intractable | No | | Not differentiable (step at 0) | Yes | | Weights grow too fast / unstable | No | | Multiple ill-defined answers | No |

\(\hat{y}\) is a step function:

Derivative = 0 everywhere except undefined at the jump (activation = 0)
Zero derivative gives no direction to fix weights
The “pointy spot” at zero is what makes it non-differentiable

Solution: replace step with a smooth, differentiable activation (softened threshold).

Sigmoid Activation¶

Sigmoid (\(\sigma\), “s-shaped”):

\[\sigma(a) = \frac{1}{1 + e^{-a}}\]

Properties:

\(a \to -\infty\): \(e^{-a} \to \infty\) → \(\sigma(a) \to 0\)
\(a \to +\infty\): \(e^{-a} \to 0\) → \(\sigma(a) \to 1\)
Gradual S-curve between ≈−5 and ≈+5 (vs. abrupt threshold)
Differentiable → gradient descent applicable

At \(a = 0\):

\[1 / (1 + e^0 ) = 1 / (1 + 1) = 1 / 2\]

Derivative (elegant closed form):

\[\sigma'(a) = \sigma(a)\,\bigl(1 - \sigma(a)\bigr)\]

At \(a = 0\): \(\sigma'(0) = \tfrac{1}{2} \cdot \tfrac{1}{2} = \tfrac{1}{4}\)
Derivative flattens at extreme activations (near 0 or 1)
Other smooth threshold functions exist with same general shape

Sigmoid behavior summary:

Activation \(a\) | \(\sigma(a)\) | \(\sigma'(a)\) |

|---------------------|——————-|-------------------| | \(-\infty\) | → 0 | → 0 | | −5 | ≈ 0 | ≈ 0 | | 0 | 0.5 | 0.25 | | +5 | ≈ 1 | ≈ 0 | | \(+\infty\) | → 1 | → 0 |

Neural Network Architecture¶

Construct networks of sigmoid units:

Input layer — components of \(x\)
Hidden layer(s) — intermediate units (not directly observed)
Output layer — \(y\)

Each unit: weighted sum of previous layer → sigmoid → output to next layer.

Back-propagation:

Full input→output mapping is differentiable in all weights
Efficient chain rule organization: forward pass (inputs → outputs), then error flows backward
Computes all weight gradients; updates weights to reduce error
Also called error back-propagation

Any differentiable activation enables this (not just sigmoid).

Sigmoid units are analogous to perceptrons but not identical — no hard threshold, no finite convergence guarantee
Replacing sigmoid with any other differentiable function still permits back-propagation

Back-Propagation Details¶

At its heart, back-propagation is a computationally beneficial organization of the chain rule:

Forward pass: compute activations layer by layer, inputs → outputs
Compute error at output (difference from target)
Backward pass: propagate error signals back through network
Each weight’s gradient tells how much to adjust to reduce error
Update all weights simultaneously via gradient descent

The “back” in back-propagation refers to error flowing from outputs toward inputs.

Local Optima¶

Unlike single-unit parabolic error surface, multi-layer networks create complex error landscapes:

Smooth but many valleys — can get stuck in local optima
No finite-time convergence guarantee
Weights at a local minimum: any single-weight change increases error, but joint changes might find better solution

Optimizing Weights & Overfitting¶

Gradient descent can get stuck in local minima. Advanced optimization:

Momentum — continue past shallow minima (physical analogy: roll through small hills)
Higher-order derivatives (Hessian-based methods)
Randomized optimization (covered later in course)
Complexity penalties — penalize overly complex structure

Overfitting in Neural Nets¶

Same issue as regression (polynomial order) and decision trees (tree size / depth):

Source of complexity | Effect |

|---------------------|——–| | More nodes | Richer mapping, more local minima, can fit noise | | More layers | Same | | Large weight magnitudes | Same architecture but more complex effective function |

Decision tree analogy for overfitting:

Let tree grow too deep → explains every quirk in data → overfits
Remedies: pruning, depth limits, complexity penalty
Neural nets: same idea via architecture limits, weight penalties, early stopping

Neural net failures (similar to overfitting):

Stuck in local minima
Weights growing too large
Training too long without stopping criterion

Cross-validation selects: number of hidden layers, nodes per layer, when to stop training.

Training vs. Validation Error¶

Training error drops monotonically with iterations
Held-out / CV error may initially drop, then rise as weights grow
Stop at minimum CV error — further training overfits even without changing architecture
Weight magnitudes often grow at the turnaround point

Restriction Bias¶

Restriction bias = representational power of the hypothesis class (what functions can be represented).

Stage | Restriction | Can represent |

|-------|————-|---------------| | Single perceptron | Linear only | Half-planes | | Perceptron network | Compositional | XOR, all Boolean functions | | Sigmoid network, many units | Weak restriction | Very general mappings |

Universal approximation:

Continuous functions (no discontinuities): 1 hidden layer with enough units — each unit models a patch, output layer stitches patches
Arbitrary functions (including discontinuous): 2 hidden layers with enough units — patches can have jumps between them

Neural nets in general: not very restrictive → high overfitting risk.

Any fixed architecture (bounded layers/units) does impose restriction — cannot capture every arbitrary function.

Preference Bias¶

Preference bias = what the learning algorithm prefers among representable hypotheses.

Weight Initialization¶

Must initialize weights (cannot start undefined)
Typical: small random values
Random: no reason to prefer one region; variability across runs; helps avoid identical local minima
Small: start with low complexity; large weights → overfitting risk

Preferences¶

Correct over incorrect
Simpler over complex (when error equal)
Start simple → allow complexity only as training progresses and error demands it
Stop before weights grow too large (cross-validation)

Occam’s Razor¶

“Entities should not be multiplied unnecessarily.”

Don’t increase complexity unless it improves fit
If two hypotheses have similar error → choose the simpler one
Formalized in supervised learning → better generalization

Neural Nets vs Other Supervised Learners¶

Property | Decision Trees | Regression | Neural Networks |

|----------|—————|------------|—————–| | Build process | Construct tree, then may overfit | Solve closed form, may overfit | Iterative training; error drops over iterations | | Overfitting signal | Tree too big | Polynomial too high | Weights too large; train too long | | Stop criterion | Pruning / CV | CV on polynomial degree | CV on architecture + stop training at CV minimum | | Unique aspect | — | — | Complexity in weight magnitude, not just structure |

Perceptron vs Sigmoid — Comparison¶

Property | Perceptron (hard threshold) | Sigmoid unit |

|----------|—————————-|--------------| | Output | 0 or 1 (discrete) | \((0, 1)\) continuous | | Differentiable | No | Yes | | Linear separability | Required for Perceptron Rule | Not required for gradient descent | | Convergence | Finite (if separable) | Asymptotic to local minimum | | Composition | Boolean circuits | Universal approximator (enough units) |

Summary¶

Perceptron: linear sum + threshold; half-plane; Perceptron Rule converges if linearly separable
Boolean logic: AND, OR, NOT via single units; XOR needs network
Gradient descent: continuous activation; robust to non-separability; local optima only
Sigmoid: smooth differentiable threshold; \(\sigma'(a) = \sigma(a)(1-\sigma(a))\)
Back-propagation: chain rule through network; error flows backward
Overfitting: control architecture, weight magnitude, training duration (CV)
Bias: weak restriction + preference for simple correct solutions