Instance Based Learning¶
Revision reference for lazy learning, k-NN, complexity, biases, and the curse of dimensionality. (ML textbook Ch. 8)
Eager vs Lazy Learning¶
Eager learners (e.g., linear regression): learn a compact function f from training data (x,y pairs), then discard data; query = f(x).
Lazy / instance-based learners: store all training data; defer computation to query time.
Lazy advantages |
Lazy disadvantages |
|---|---|
Exact recall of seen points |
No generalization for unseen x |
No training phase |
Sensitive to noise (memorization) |
Simple |
Duplicate x with different y → ambiguous |
Lookup baseline: F(x) = database lookup(x) — exact match only; no generalization.
1-Nearest Neighbor¶
For unseen query q: label = label of nearest training point (by distance metric).
Works when locality holds (e.g., house prices by map location).
Fails when neighbors disagree (boundary / ambiguous regions).
Georgia Tech housing example: classify house price (red/blue/green) by map location.
Black query near green cluster → 1-NN → green ✓
Black query in red cluster → 1-NN → red ✓
Black query between red, blue, green → 1-NN ambiguous → use k > 1
k-Nearest Neighbor (k-NN)¶
Algorithm (pseudocode):
Given training set D, distance metric, integer k, query q.
Find set NN = k training points closest to q (break ties: include all at k-th distance).
Classification: return mode (plurality vote) of labels in NN.
Regression: return mean of y values in NN.
Pseudocode:
KNN(D, dist, k, q):
NN = {k points in D with smallest dist(·, q)} # tie-break: all at k-th distance
if classification: return mode({y : (x,y) in NN})
if regression: return mean({y : (x,y) in NN})
Tie-breaking (classification): pick alphabetically, random, or global label frequency.
Free parameters (domain knowledge):
k: number of neighbors (k=1 brittle; larger k smoother).
Distance / similarity metric: how “near” is defined.
Variants:
Weighted vote / average: weight by 1/distance (closer neighbors matter more).
k = n with weighted average still varies by query location (not constant function).
Distance Metrics¶
Euclidean (L2): √(Σ(x_i − q_i)²) — order preserved without sqrt for ranking.
Manhattan (L1): Σ|x_i − q_i|
Weighted distances: scale dimensions differently (addresses unequal feature importance).
Discrete / mixed features: mismatch counts, domain-specific similarity (images, etc.).
Distance = domain knowledge; wrong metric → poor performance regardless of k.
Complexity (1-D, sorted data, n points)¶
1-NN learn |
1-NN query |
k-NN query |
Lin. reg. |
|
|---|---|---|---|---|
Time |
O(1) |
O(log n) |
O(log n + k) |
O(n) learn |
O(1) query |
||||
Space (model) |
O(n) |
O(1) query |
O(1) query |
O(1) |
Unsorted 1-D: nearest neighbor scan O(n).
Trade-off: lazy = cheap learn, cost at query; eager = pay once, cheap queries.
If querying >> n times, eager can win overall.
Lazy vs eager: nearest-neighbor = lazy (procrastinate until query); linear regression = eager (learn upfront).
k-NN in higher dimensions: brute-force query O(n) per query; spatial data structures (k-d trees) help low d but degrade in high d.
Domain Knowledge Example¶
True function: y = x₁² + x₂. Query (4, 2) → true y = 18.
Method |
Prediction |
|---|---|
1-NN Euclidean |
8 |
3-NN Euclidean |
42 |
1-NN Manhattan |
29 |
3-NN Manhattan |
35.5 |
Custom (sq x₁) |
~12 (closer) |
Lesson: k, metric, and weighting encode assumptions; none matched the true function here — yet k-NN often works well when biases align with the domain.
k-NN Preference Biases¶
Locality: nearby points are similar (via chosen metric).
Smoothness: averaging/voting assumes labels change smoothly over space.
Equal feature relevance: standard L1/L2 treat all dimensions equally — bad when features have different scales or importance (e.g., x₁² vs x₁ in target function).
Optimal distance metric exists per fixed problem but may be hard to find; oracle metric (same label → distance 0) requires solving the problem.
Curse of Dimensionality¶
Bellman’s curse of dimensionality: as dimensionality d grows, data needed to generalize grows exponentially (~10^d points to maintain fixed neighborhood coverage).
1-D line: 10 points cover 1/10 each.
2-D square: need ~100 points (10×10 grid) for same local density.
d dimensions: ~10^d points.
Implications for k-NN:
“Nearest” neighbors become far in high dimensions → weak locality.
Irrelevant features dilute signal (all features weighted equally by default).
Adding features without more data hurts generalization.
Mitigations: feature selection, weighted metrics, dimensionality reduction (covered later in course).
Locally Weighted Regression¶
Replace simple mean with local model fit on neighbors:
Weight neighbors by similarity (e.g., 1/distance).
Fit linear regression (or any learner) on weighted neighborhood → locally weighted linear regression (LWLR).
k = n with weighting: prediction varies by query; can approximate curves from linear local pieces.
Can substitute decision trees, neural nets, etc. for the local model.
Effect: simple global hypothesis class + local fitting → effectively richer hypothesis space.