Instance Based Learning¶

Revision reference for lazy learning, k-NN, complexity, biases, and the curse of dimensionality. (ML textbook Ch. 8)

Eager vs Lazy Learning¶

Eager learners (e.g., linear regression): learn a compact function f from training data (x,y pairs), then discard data; query = f(x).

Lazy / instance-based learners: store all training data; defer computation to query time.

Lazy advantages	Lazy disadvantages
Exact recall of seen points	No generalization for unseen x
No training phase	Sensitive to noise (memorization)
Simple	Duplicate x with different y → ambiguous

Lookup baseline: F(x) = database lookup(x) — exact match only; no generalization.

1-Nearest Neighbor¶

For unseen query q: label = label of nearest training point (by distance metric).

Works when locality holds (e.g., house prices by map location).
Fails when neighbors disagree (boundary / ambiguous regions).

Georgia Tech housing example: classify house price (red/blue/green) by map location.

Black query near green cluster → 1-NN → green ✓
Black query in red cluster → 1-NN → red ✓
Black query between red, blue, green → 1-NN ambiguous → use k > 1

k-Nearest Neighbor (k-NN)¶

Algorithm (pseudocode):

Given training set D, distance metric, integer k, query q.
Find set NN = k training points closest to q (break ties: include all at k-th distance).
Classification: return mode (plurality vote) of labels in NN.
Regression: return mean of y values in NN.

Pseudocode:

KNN(D, dist, k, q):
  NN = {k points in D with smallest dist(·, q)}  # tie-break: all at k-th distance
  if classification: return mode({y : (x,y) in NN})
  if regression:    return mean({y : (x,y) in NN})

Tie-breaking (classification): pick alphabetically, random, or global label frequency.

Free parameters (domain knowledge):

k: number of neighbors (k=1 brittle; larger k smoother).
Distance / similarity metric: how “near” is defined.

Variants:

Weighted vote / average: weight by 1/distance (closer neighbors matter more).
k = n with weighted average still varies by query location (not constant function).

Distance Metrics¶

Euclidean (L2): √(Σ(x_i − q_i)²) — order preserved without sqrt for ranking.
Manhattan (L1): Σ|x_i − q_i|
Weighted distances: scale dimensions differently (addresses unequal feature importance).
Discrete / mixed features: mismatch counts, domain-specific similarity (images, etc.).

Distance = domain knowledge; wrong metric → poor performance regardless of k.

Complexity (1-D, sorted data, n points)¶

	1-NN learn	1-NN query	k-NN query	Lin. reg.
Time	O(1)	O(log n)	O(log n + k)	O(n) learn
				O(1) query
Space (model)	O(n)	O(1) query	O(1) query	O(1)

Unsorted 1-D: nearest neighbor scan O(n).
Trade-off: lazy = cheap learn, cost at query; eager = pay once, cheap queries.
If querying >> n times, eager can win overall.

Lazy vs eager: nearest-neighbor = lazy (procrastinate until query); linear regression = eager (learn upfront).

k-NN in higher dimensions: brute-force query O(n) per query; spatial data structures (k-d trees) help low d but degrade in high d.

Domain Knowledge Example¶

True function: y = x₁² + x₂. Query (4, 2) → true y = 18.

Method	Prediction
1-NN Euclidean	8
3-NN Euclidean	42
1-NN Manhattan	29
3-NN Manhattan	35.5
Custom (sq x₁)	~12 (closer)

Lesson: k, metric, and weighting encode assumptions; none matched the true function here — yet k-NN often works well when biases align with the domain.

k-NN Preference Biases¶

Locality: nearby points are similar (via chosen metric).
Smoothness: averaging/voting assumes labels change smoothly over space.
Equal feature relevance: standard L1/L2 treat all dimensions equally — bad when features have different scales or importance (e.g., x₁² vs x₁ in target function).

Optimal distance metric exists per fixed problem but may be hard to find; oracle metric (same label → distance 0) requires solving the problem.

Curse of Dimensionality¶

Bellman’s curse of dimensionality: as dimensionality d grows, data needed to generalize grows exponentially (~10^d points to maintain fixed neighborhood coverage).

1-D line: 10 points cover 1/10 each.
2-D square: need ~100 points (10×10 grid) for same local density.
d dimensions: ~10^d points.

Implications for k-NN:

“Nearest” neighbors become far in high dimensions → weak locality.
Irrelevant features dilute signal (all features weighted equally by default).
Adding features without more data hurts generalization.

Mitigations: feature selection, weighted metrics, dimensionality reduction (covered later in course).

Locally Weighted Regression¶

Replace simple mean with local model fit on neighbors:

Weight neighbors by similarity (e.g., 1/distance).
Fit linear regression (or any learner) on weighted neighborhood → locally weighted linear regression (LWLR).
k = n with weighting: prediction varies by query; can approximate curves from linear local pieces.
Can substitute decision trees, neural nets, etc. for the local model.

Effect: simple global hypothesis class + local fitting → effectively richer hypothesis space.