Kernel Methods and SVM¶

Support Vector Machines¶

Motivation: Maximum Margin¶

Setup: 2D points labeled + (positive) and − (negative), linearly separable.

Infinitely many separating lines exist (e.g., 3 green × 3 red anchor points → 9 possible lines).
Best line: middle line with equal space on both sides — largest margin / demilitarized zone.
Lines close to one class risk misclassifying nearby unseen points → overfitting (believing training data too much).
Least commitment principle: be consistent with data while committing as little as possible.
This is the foundation of Support Vector Machines (SVMs).

Hyperplane Notation¶

In multiple dimensions, separating surface is a hyperplane:

\[y = w^T x + b\]

Symbol

Meaning

\(y\) \(w\) \(b\) \(x\)

Classification label (+1 / −1) Parameter vector (normal to hyperplane) Bias — shifts hyperplane in/out of origin Input feature vector
Decision boundary (classifier output = 0, neither positive nor negative):

\[w^T x + b = 0\]
Labels: \(y_i \in \{-1, +1\}\) (same convention as boosting).
Margin boundaries — parallel hyperplanes touching each class without misclassification:

\[w^T x + b = +1 \quad \text{(positive class boundary)}\]

\[w^T x + b = -1 \quad \text{(negative class boundary)}\]

Margin Derivation¶

Construction:

Top gray line: as close as possible to + points without crossing → outputs +1 at first + point.
Bottom gray line: as close as possible to − points → outputs −1.
Middle orange line: decision boundary, equidistant from both gray lines.

Distance between parallel hyperplanes:

Pick \(x_1\) on +1 boundary, \(x_2\) on −1 boundary.
Subtract equations:

\[w^T (x_1 - x_2) = 2\]
Divide both sides by \(|w|\) (unit normal):

\[\frac{w^T}{|w|} (x_1 - x_2) = \frac{2}{|w|}\]
Left side = length of projection of \(x_1 - x_2\) onto \(w\).
\(w\) is perpendicular to hyperplane; \(x_1 - x_2\) is also perpendicular → projection equals full distance.
Margin \(m\):

\[m = \frac{2}{|w|}\]

Optimization goal: maximize \(m\) ⟺ minimize \(|w|\), subject to correct classification.

Setting \(w = 0\) would maximize margin but fail to classify — need constraints.

Primal Optimization Problem¶

Correct classification as constraint:

Want: classifier output ≥ +1 for positive examples, ≤ −1 for negative examples.
Multiply both sides by label \(y_i\) (clever trick — flips inequality for negatives):

\[y_i (w^T x_i + b) \geq 1 \quad \forall i \in \text{training set}\]

Primal formulation:

Original: maximize \(\frac{2}{|W|}\) subject to constraints above.
Equivalent (easier to solve): minimize \(\frac{1}{2}|W|^2\) subject to same constraints. - Reciprocal reverses max/min (for positive values). - Squaring is monotone → same optimum, magnified differences.

Quadratic Programming (QP):

Form: minimize quadratic function subject to linear constraints.
SVM primal always has a solution; always has a unique solution.
Solved via standard QP solvers.

Dual Form (Lagrange Multipliers)¶

Transform primal → dual (via Lagrange multipliers / KKT):

\[\max_\alpha \;\; \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j\]

Constraints:

\[\alpha_i \geq 0 \quad \forall i\]

\[\sum_i \alpha_i y_i = 0\]

Recover primal weights from dual:

\[w = \sum_i \alpha_i y_i x_i\]

Recover bias \(b\): plug any support vector (on margin boundary) into \(w^T x + b = \pm 1\).

Support Vectors¶

Key property: in the dual solution, most \(\alpha_i = 0\).

Non-zero \(\alpha_i\) → point is a support vector.
\(w = \sum_i \alpha_i y_i x_i\) — only non-zero-α points contribute.
Points far from decision boundary → \(\alpha = 0\) (no influence on boundary).
Points on/near boundary → non-zero \(\alpha\) (define the margin contours).

Quiz: Optimal Separator — which points have zero α?

Answer: points far from the green decision line (e.g., lower-left −, upper-right +).
Points close to boundary are support vectors.

Analogy to KNN:

Like instance-based learning where only local points matter.
SVM pre-computes which points matter (via QP) — can discard the rest.
Not just nearest neighbors; QP selects the informative ones.

Non-Linearly Separable Data¶

Case 1: Outlier / noise

Single − point among + cluster → no zero-error linear separator; margin becomes negative.
Soft margin: find separator minimizing misclassifications while maximizing margin (homework).
Trade-off knob between margin size and number of errors.

Case 2: Non-linear structure (“Linearly Married”)

- points form inner cluster; − points form outer ring.
No line separates classes in 2D.
Solution: transform data to higher dimensions where separation becomes linear.

Feature Mapping and the Kernel Trick¶

Feature Map φ — Worked Example¶

Problem: ring of − points around + cluster; not linearly separable in 2D.

Transform \(\phi\): 2D → 3D without adding external information:

\[\phi(q) = \left(q_1^2,\; q_2^2,\; \sqrt{2}\, q_1 q_2\right) \quad \text{where } q = (q_1, q_2)\]

Pluses near origin → lower 3rd coordinate.
Minuses on ring → higher 3rd coordinate.
Hyperplane in 3D separates the classes.

Quiz: What is the Output? — compute \(\phi(x)^T \phi(y)\):

\[\phi(x)^T \phi(y) = x_1^2 y_1^2 + x_2^2 y_2^2 + 2 x_1 x_2 y_1 y_2 = (x_1 y_1 + x_2 y_2)^2 = (x^T y)^2\]

Factorization: \((x^T y)^2\) — squared dot product in original space.
Geometric shift: - Original \(x^T y\): projection-based (directional) similarity. - Transformed: circular/radial similarity (inside vs outside circle).

The Kernel Trick:

Dual QP uses data only via dot products \(x_i^T x_j\).
Instead of computing \(\phi(x_i)^T \phi(x_j)\) explicitly, define:

\[K(x_i, x_j) = \phi(x_i)^T \phi(x_j)\]
For this example: \(K(x, y) = (x^T y)^2\).
Never compute \(\phi\) — just square the dot product in code.
Same optimization, implicit high-dimensional mapping.

Kernels¶

Definition¶

Kernel \(K(x_i, x_j)\) replaces \(x_i^T x_j\) throughout the dual SVM.
Represents similarity between data points.
Mechanism for injecting domain knowledge into SVM learning.
Every valid kernel corresponds to some (possibly infinite-dimensional) feature map \(\phi\).
\(K(x, y)\) = dot product in that implicit feature space.

Common Kernels¶

Linear kernel (standard dot product):

\[K(x, y) = x^T y\]

Polynomial kernel (general form):

\[K(x, y) = (x^T y + c)^p\]

Parameters	Special case
\(p=1, c=0\) \| Linear kernel \(p=2, c=0\) \| Squared dot product

Related to polynomial regression — represents polynomial decision boundaries.

Radial Basis Function (RBF / Gaussian) kernel:

\[K(x, y) = \exp\left(-\frac{\|x - y\|^2}{2\sigma^2}\right)\]

\(x \approx y\) → \(K \approx e^0 = 1\) (high similarity).
\(x, y\) far apart → \(K \approx 0\).
Symmetric Gaussian bump; \(\sigma\) controls width (like bandwidth).

Sigmoid kernel:

\[K(x, y) = \tanh(\alpha\, x^T y + c)\]

S-shaped transition between similarity levels.

Kernel Properties¶

Kernels work on non-numeric data if similarity is definable: - Strings: edit distance (few transformations → similar). - Graphs, images: custom similarity functions.
Not every function is a valid kernel.
Mercer Condition: technical requirement (positive semi-definite) for kernel to behave as a proper similarity/distance function. Required for all SVM math to hold.
Intuition: kernel must relate points consistently, not be arbitrary.
In practice: almost any reasonable similarity function has an equivalent (possibly infinite-dimensional) feature map.

SVM Summary¶

Boosting and Margins¶

Why Boosting Resists Overfitting¶

Observation: boosting training error → 0, but test error does not degrade — often keeps improving.

Boosted classifier:

\[h(x) = \text{sign}\left(\sum_t \alpha_t h_t(x)\right)\]

\(\alpha_t\) here = weight on weak hypothesis \(h_t\) (different from SVM dual \(\alpha_i\)).
\(\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}\) > 0 always (weak learner beats chance).

Normalized output (for analysis; same classification):

\[\frac{\sum_t \alpha_t h_t(x)}{\sum_t \alpha_t} \in [-1, +1]\]

Error vs confidence:

Output region	Interpretation
Correct, near +1 or −1 Correct, near 0 Incorrect, far from 0	Confident and correct Correct but unconfident Confident and wrong

KNN analogy: low variance among neighbors = high confidence.

After zero training error, continued boosting:

Training error stays at zero.
Hard examples (near boundary) get re-weighted; new weak learners focus on them.
Boundary points drift farther from decision boundary → confidence increases.
Effect: margin grows between + and − clusters.
Large margins → less overfitting (same principle as SVM).

More weak learners → smoother, less overfit classifier (counter-intuitive: more complexity, less overfit).

When Boosting Overfits¶

Quiz: Boosting Tends to Overfit — which conditions cause overfitting?

Why neural network weak learner fails:

Powerful NN perfectly fits training data → overfits.
Returns same overfit hypothesis every round (all examples equal weight after zero error).
Weighted sum of identical functions = that function → boosting cannot help.

Other overfitting case: pink noise (= uniform noise) in data.

Weak vs strong learner terminology:

Weak learner (formal): for any distribution \(D\), error \(\leq \frac{1}{2} - \epsilon\).
“Strong learner” is informal — not a precise technical term.
Any strong learner is also technically a weak learner (does better than chance).
Not weak ⟹ not a learner at all (error ≥ ½, provides no information).