Kernel Methods and SVM

Support Vector Machines

Motivation: Maximum Margin

Setup: 2D points labeled + (positive) and − (negative), linearly separable.

  • Infinitely many separating lines exist (e.g., 3 green × 3 red anchor points → 9 possible lines).

  • Best line: middle line with equal space on both sides — largest margin / demilitarized zone.

  • Lines close to one class risk misclassifying nearby unseen points → overfitting (believing training data too much).

  • Least commitment principle: be consistent with data while committing as little as possible.

  • This is the foundation of Support Vector Machines (SVMs).

Hyperplane Notation

  • In multiple dimensions, separating surface is a hyperplane:

    \[y = w^T x + b\]

    Symbol

    Meaning

    \(y\) \(w\) \(b\) \(x\)

    Classification label (+1 / −1) Parameter vector (normal to hyperplane) Bias — shifts hyperplane in/out of origin Input feature vector

  • Decision boundary (classifier output = 0, neither positive nor negative):

    \[w^T x + b = 0\]
  • Labels: \(y_i \in \{-1, +1\}\) (same convention as boosting).

  • Margin boundaries — parallel hyperplanes touching each class without misclassification:

    \[w^T x + b = +1 \quad \text{(positive class boundary)}\]
    \[w^T x + b = -1 \quad \text{(negative class boundary)}\]

Margin Derivation

Construction:

  1. Top gray line: as close as possible to + points without crossing → outputs +1 at first + point.

  2. Bottom gray line: as close as possible to − points → outputs −1.

  3. Middle orange line: decision boundary, equidistant from both gray lines.

Distance between parallel hyperplanes:

  • Pick \(x_1\) on +1 boundary, \(x_2\) on −1 boundary.

  • Subtract equations:

    \[w^T (x_1 - x_2) = 2\]
  • Divide both sides by \(|w|\) (unit normal):

    \[\frac{w^T}{|w|} (x_1 - x_2) = \frac{2}{|w|}\]
  • Left side = length of projection of \(x_1 - x_2\) onto \(w\).

  • \(w\) is perpendicular to hyperplane; \(x_1 - x_2\) is also perpendicular → projection equals full distance.

  • Margin \(m\):

    \[m = \frac{2}{|w|}\]

Optimization goal: maximize \(m\) ⟺ minimize \(|w|\), subject to correct classification.

  • Setting \(w = 0\) would maximize margin but fail to classify — need constraints.

Primal Optimization Problem

Correct classification as constraint:

  • Want: classifier output ≥ +1 for positive examples, ≤ −1 for negative examples.

  • Multiply both sides by label \(y_i\) (clever trick — flips inequality for negatives):

    \[y_i (w^T x_i + b) \geq 1 \quad \forall i \in \text{training set}\]

Primal formulation:

  • Original: maximize \(\frac{2}{|W|}\) subject to constraints above.

  • Equivalent (easier to solve): minimize \(\frac{1}{2}|W|^2\) subject to same constraints. - Reciprocal reverses max/min (for positive values). - Squaring is monotone → same optimum, magnified differences.

Quadratic Programming (QP):

  • Form: minimize quadratic function subject to linear constraints.

  • SVM primal always has a solution; always has a unique solution.

  • Solved via standard QP solvers.

Dual Form (Lagrange Multipliers)

Transform primal → dual (via Lagrange multipliers / KKT):

\[\max_\alpha \;\; \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j\]

Constraints:

\[\alpha_i \geq 0 \quad \forall i\]
\[\sum_i \alpha_i y_i = 0\]

Recover primal weights from dual:

\[w = \sum_i \alpha_i y_i x_i\]

Recover bias \(b\): plug any support vector (on margin boundary) into \(w^T x + b = \pm 1\).

Support Vectors

Key property: in the dual solution, most \(\alpha_i = 0\).

  • Non-zero \(\alpha_i\) → point is a support vector.

  • \(w = \sum_i \alpha_i y_i x_i\) — only non-zero-α points contribute.

  • Points far from decision boundary\(\alpha = 0\) (no influence on boundary).

  • Points on/near boundary → non-zero \(\alpha\) (define the margin contours).

Quiz: Optimal Separator — which points have zero α?

  • Answer: points far from the green decision line (e.g., lower-left −, upper-right +).

  • Points close to boundary are support vectors.

Analogy to KNN:

  • Like instance-based learning where only local points matter.

  • SVM pre-computes which points matter (via QP) — can discard the rest.

  • Not just nearest neighbors; QP selects the informative ones.

Non-Linearly Separable Data

Case 1: Outlier / noise

  • Single − point among + cluster → no zero-error linear separator; margin becomes negative.

  • Soft margin: find separator minimizing misclassifications while maximizing margin (homework).

  • Trade-off knob between margin size and number of errors.

Case 2: Non-linear structure (“Linearly Married”)

    • points form inner cluster; − points form outer ring.

  • No line separates classes in 2D.

  • Solution: transform data to higher dimensions where separation becomes linear.

Feature Mapping and the Kernel Trick

Feature Map φ — Worked Example

Problem: ring of − points around + cluster; not linearly separable in 2D.

Transform \(\phi\): 2D → 3D without adding external information:

\[\phi(q) = \left(q_1^2,\; q_2^2,\; \sqrt{2}\, q_1 q_2\right) \quad \text{where } q = (q_1, q_2)\]
  • Pluses near origin → lower 3rd coordinate.

  • Minuses on ring → higher 3rd coordinate.

  • Hyperplane in 3D separates the classes.

Quiz: What is the Output? — compute \(\phi(x)^T \phi(y)\):

\[\phi(x)^T \phi(y) = x_1^2 y_1^2 + x_2^2 y_2^2 + 2 x_1 x_2 y_1 y_2 = (x_1 y_1 + x_2 y_2)^2 = (x^T y)^2\]
  • Factorization: \((x^T y)^2\)squared dot product in original space.

  • Geometric shift: - Original \(x^T y\): projection-based (directional) similarity. - Transformed: circular/radial similarity (inside vs outside circle).

The Kernel Trick:

  • Dual QP uses data only via dot products \(x_i^T x_j\).

  • Instead of computing \(\phi(x_i)^T \phi(x_j)\) explicitly, define:

    \[K(x_i, x_j) = \phi(x_i)^T \phi(x_j)\]
  • For this example: \(K(x, y) = (x^T y)^2\).

  • Never compute \(\phi\) — just square the dot product in code.

  • Same optimization, implicit high-dimensional mapping.

Kernels

Definition

  • Kernel \(K(x_i, x_j)\) replaces \(x_i^T x_j\) throughout the dual SVM.

  • Represents similarity between data points.

  • Mechanism for injecting domain knowledge into SVM learning.

  • Every valid kernel corresponds to some (possibly infinite-dimensional) feature map \(\phi\).

  • \(K(x, y)\) = dot product in that implicit feature space.

Common Kernels

Linear kernel (standard dot product):

\[K(x, y) = x^T y\]

Polynomial kernel (general form):

\[K(x, y) = (x^T y + c)^p\]

Parameters

Special case

\(p=1, c=0\) | Linear kernel \(p=2, c=0\) | Squared dot product

  • Related to polynomial regression — represents polynomial decision boundaries.

Radial Basis Function (RBF / Gaussian) kernel:

\[K(x, y) = \exp\left(-\frac{\|x - y\|^2}{2\sigma^2}\right)\]
  • \(x \approx y\)\(K \approx e^0 = 1\) (high similarity).

  • \(x, y\) far apart → \(K \approx 0\).

  • Symmetric Gaussian bump; \(\sigma\) controls width (like bandwidth).

Sigmoid kernel:

\[K(x, y) = \tanh(\alpha\, x^T y + c)\]
  • S-shaped transition between similarity levels.

Kernel Properties

  • Kernels work on non-numeric data if similarity is definable: - Strings: edit distance (few transformations → similar). - Graphs, images: custom similarity functions.

  • Not every function is a valid kernel.

  • Mercer Condition: technical requirement (positive semi-definite) for kernel to behave as a proper similarity/distance function. Required for all SVM math to hold.

  • Intuition: kernel must relate points consistently, not be arbitrary.

  • In practice: almost any reasonable similarity function has an equivalent (possibly infinite-dimensional) feature map.

SVM Summary

Boosting and Margins

Why Boosting Resists Overfitting

Observation: boosting training error → 0, but test error does not degrade — often keeps improving.

Boosted classifier:

\[h(x) = \text{sign}\left(\sum_t \alpha_t h_t(x)\right)\]
  • \(\alpha_t\) here = weight on weak hypothesis \(h_t\) (different from SVM dual \(\alpha_i\)).

  • \(\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}\) > 0 always (weak learner beats chance).

Normalized output (for analysis; same classification):

\[\frac{\sum_t \alpha_t h_t(x)}{\sum_t \alpha_t} \in [-1, +1]\]

Error vs confidence:

Output region

Interpretation

Correct, near +1 or −1 Correct, near 0 Incorrect, far from 0

Confident and correct Correct but unconfident Confident and wrong

  • KNN analogy: low variance among neighbors = high confidence.

After zero training error, continued boosting:

  1. Training error stays at zero.

  2. Hard examples (near boundary) get re-weighted; new weak learners focus on them.

  3. Boundary points drift farther from decision boundary → confidence increases.

  4. Effect: margin grows between + and − clusters.

  5. Large margins → less overfitting (same principle as SVM).

  • More weak learners → smoother, less overfit classifier (counter-intuitive: more complexity, less overfit).

When Boosting Overfits

Quiz: Boosting Tends to Overfit — which conditions cause overfitting?

Why neural network weak learner fails:

  • Powerful NN perfectly fits training data → overfits.

  • Returns same overfit hypothesis every round (all examples equal weight after zero error).

  • Weighted sum of identical functions = that function → boosting cannot help.

Other overfitting case: pink noise (= uniform noise) in data.

Weak vs strong learner terminology:

  • Weak learner (formal): for any distribution \(D\), error \(\leq \frac{1}{2} - \epsilon\).

  • “Strong learner” is informal — not a precise technical term.

  • Any strong learner is also technically a weak learner (does better than chance).

  • Not weak ⟹ not a learner at all (error ≥ ½, provides no information).