Kernel Methods and SVM¶
Support Vector Machines¶
Motivation: Maximum Margin¶
Setup: 2D points labeled + (positive) and − (negative), linearly separable.
Infinitely many separating lines exist (e.g., 3 green × 3 red anchor points → 9 possible lines).
Best line: middle line with equal space on both sides — largest margin / demilitarized zone.
Lines close to one class risk misclassifying nearby unseen points → overfitting (believing training data too much).
Least commitment principle: be consistent with data while committing as little as possible.
This is the foundation of Support Vector Machines (SVMs).
Hyperplane Notation¶
In multiple dimensions, separating surface is a hyperplane:
\[y = w^T x + b\]Symbol
Meaning
\(y\) \(w\) \(b\) \(x\)
Classification label (+1 / −1) Parameter vector (normal to hyperplane) Bias — shifts hyperplane in/out of origin Input feature vector
Decision boundary (classifier output = 0, neither positive nor negative):
\[w^T x + b = 0\]Labels: \(y_i \in \{-1, +1\}\) (same convention as boosting).
Margin boundaries — parallel hyperplanes touching each class without misclassification:
\[w^T x + b = +1 \quad \text{(positive class boundary)}\]\[w^T x + b = -1 \quad \text{(negative class boundary)}\]
Margin Derivation¶
Construction:
Top gray line: as close as possible to + points without crossing → outputs +1 at first + point.
Bottom gray line: as close as possible to − points → outputs −1.
Middle orange line: decision boundary, equidistant from both gray lines.
Distance between parallel hyperplanes:
Pick \(x_1\) on +1 boundary, \(x_2\) on −1 boundary.
Subtract equations:
\[w^T (x_1 - x_2) = 2\]Divide both sides by \(|w|\) (unit normal):
\[\frac{w^T}{|w|} (x_1 - x_2) = \frac{2}{|w|}\]Left side = length of projection of \(x_1 - x_2\) onto \(w\).
\(w\) is perpendicular to hyperplane; \(x_1 - x_2\) is also perpendicular → projection equals full distance.
Margin \(m\):
\[m = \frac{2}{|w|}\]
Optimization goal: maximize \(m\) ⟺ minimize \(|w|\), subject to correct classification.
Setting \(w = 0\) would maximize margin but fail to classify — need constraints.
Primal Optimization Problem¶
Correct classification as constraint:
Want: classifier output ≥ +1 for positive examples, ≤ −1 for negative examples.
Multiply both sides by label \(y_i\) (clever trick — flips inequality for negatives):
\[y_i (w^T x_i + b) \geq 1 \quad \forall i \in \text{training set}\]
Primal formulation:
Original: maximize \(\frac{2}{|W|}\) subject to constraints above.
Equivalent (easier to solve): minimize \(\frac{1}{2}|W|^2\) subject to same constraints. - Reciprocal reverses max/min (for positive values). - Squaring is monotone → same optimum, magnified differences.
Quadratic Programming (QP):
Form: minimize quadratic function subject to linear constraints.
SVM primal always has a solution; always has a unique solution.
Solved via standard QP solvers.
Dual Form (Lagrange Multipliers)¶
Transform primal → dual (via Lagrange multipliers / KKT):
Constraints:
Recover primal weights from dual:
Recover bias \(b\): plug any support vector (on margin boundary) into \(w^T x + b = \pm 1\).
Support Vectors¶
Key property: in the dual solution, most \(\alpha_i = 0\).
Non-zero \(\alpha_i\) → point is a support vector.
\(w = \sum_i \alpha_i y_i x_i\) — only non-zero-α points contribute.
Points far from decision boundary → \(\alpha = 0\) (no influence on boundary).
Points on/near boundary → non-zero \(\alpha\) (define the margin contours).
Quiz: Optimal Separator — which points have zero α?
Answer: points far from the green decision line (e.g., lower-left −, upper-right +).
Points close to boundary are support vectors.
Analogy to KNN:
Like instance-based learning where only local points matter.
SVM pre-computes which points matter (via QP) — can discard the rest.
Not just nearest neighbors; QP selects the informative ones.
Non-Linearly Separable Data¶
Case 1: Outlier / noise
Single − point among + cluster → no zero-error linear separator; margin becomes negative.
Soft margin: find separator minimizing misclassifications while maximizing margin (homework).
Trade-off knob between margin size and number of errors.
Case 2: Non-linear structure (“Linearly Married”)
points form inner cluster; − points form outer ring.
No line separates classes in 2D.
Solution: transform data to higher dimensions where separation becomes linear.
Feature Mapping and the Kernel Trick¶
Feature Map φ — Worked Example¶
Problem: ring of − points around + cluster; not linearly separable in 2D.
Transform \(\phi\): 2D → 3D without adding external information:
Pluses near origin → lower 3rd coordinate.
Minuses on ring → higher 3rd coordinate.
Hyperplane in 3D separates the classes.
Quiz: What is the Output? — compute \(\phi(x)^T \phi(y)\):
Factorization: \((x^T y)^2\) — squared dot product in original space.
Geometric shift: - Original \(x^T y\): projection-based (directional) similarity. - Transformed: circular/radial similarity (inside vs outside circle).
The Kernel Trick:
Dual QP uses data only via dot products \(x_i^T x_j\).
Instead of computing \(\phi(x_i)^T \phi(x_j)\) explicitly, define:
\[K(x_i, x_j) = \phi(x_i)^T \phi(x_j)\]For this example: \(K(x, y) = (x^T y)^2\).
Never compute \(\phi\) — just square the dot product in code.
Same optimization, implicit high-dimensional mapping.
Kernels¶
Definition¶
Kernel \(K(x_i, x_j)\) replaces \(x_i^T x_j\) throughout the dual SVM.
Represents similarity between data points.
Mechanism for injecting domain knowledge into SVM learning.
Every valid kernel corresponds to some (possibly infinite-dimensional) feature map \(\phi\).
\(K(x, y)\) = dot product in that implicit feature space.
Common Kernels¶
Linear kernel (standard dot product):
Polynomial kernel (general form):
Parameters |
Special case |
|---|---|
\(p=1, c=0\) | Linear kernel \(p=2, c=0\) | Squared dot product |
|
Related to polynomial regression — represents polynomial decision boundaries.
Radial Basis Function (RBF / Gaussian) kernel:
\(x \approx y\) → \(K \approx e^0 = 1\) (high similarity).
\(x, y\) far apart → \(K \approx 0\).
Symmetric Gaussian bump; \(\sigma\) controls width (like bandwidth).
Sigmoid kernel:
S-shaped transition between similarity levels.
Kernel Properties¶
Kernels work on non-numeric data if similarity is definable: - Strings: edit distance (few transformations → similar). - Graphs, images: custom similarity functions.
Not every function is a valid kernel.
Mercer Condition: technical requirement (positive semi-definite) for kernel to behave as a proper similarity/distance function. Required for all SVM math to hold.
Intuition: kernel must relate points consistently, not be arbitrary.
In practice: almost any reasonable similarity function has an equivalent (possibly infinite-dimensional) feature map.
SVM Summary¶
Boosting and Margins¶
Why Boosting Resists Overfitting¶
Observation: boosting training error → 0, but test error does not degrade — often keeps improving.
Boosted classifier:
\(\alpha_t\) here = weight on weak hypothesis \(h_t\) (different from SVM dual \(\alpha_i\)).
\(\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}\) > 0 always (weak learner beats chance).
Normalized output (for analysis; same classification):
Error vs confidence:
Output region |
Interpretation |
|---|---|
Correct, near +1 or −1 Correct, near 0 Incorrect, far from 0 |
Confident and correct Correct but unconfident Confident and wrong |
KNN analogy: low variance among neighbors = high confidence.
After zero training error, continued boosting:
Training error stays at zero.
Hard examples (near boundary) get re-weighted; new weak learners focus on them.
Boundary points drift farther from decision boundary → confidence increases.
Effect: margin grows between + and − clusters.
Large margins → less overfitting (same principle as SVM).
More weak learners → smoother, less overfit classifier (counter-intuitive: more complexity, less overfit).
When Boosting Overfits¶
Quiz: Boosting Tends to Overfit — which conditions cause overfitting?
Why neural network weak learner fails:
Powerful NN perfectly fits training data → overfits.
Returns same overfit hypothesis every round (all examples equal weight after zero error).
Weighted sum of identical functions = that function → boosting cannot help.
Other overfitting case: pink noise (= uniform noise) in data.
Weak vs strong learner terminology:
Weak learner (formal): for any distribution \(D\), error \(\leq \frac{1}{2} - \epsilon\).
“Strong learner” is informal — not a precise technical term.
Any strong learner is also technically a weak learner (does better than chance).
Not weak ⟹ not a learner at all (error ≥ ½, provides no information).