Machine Learning

Supervised Learning

k-Nearest Neighbors (k-NN)

  • Classify a new point by finding the k nearest examples in training data and applying majority label

  • Distance metric (typically Euclidean) determines “nearest”

  • k=1: decision boundary is Voronoi tessellation of training points

  • Larger k smooths boundaries but may lose local structure

  • Non-parametric: no assumptions about data distribution

  • Lazy learner — stores all training data, no explicit model built

Cross-Validation

  • Split data into k folds; train on k-1, test on remaining fold; rotate and average error

  • Estimates generalization performance without wasting data

  • For time-series/sequence data, random fold selection is invalid — use temporal splits

  • Leave-One-Out CV (LOOCV_): k = n; low bias, high variance of error estimate

  • Use cross-validation to select hyperparameters (e.g., k in k-NN, tree depth)

Gaussian Distribution and Bayes Classification

  • Gaussian (Normal) distribution: parameterized by mean μ and variance σ²

  • Central Limit Theorem: sum of many independent random variables → Gaussian

  • Decision boundary between two Gaussians: point where posterior probabilities are equal

  • Classification error = area of overlap between class-conditional distributions

Bayes Classifier:

  • Optimal classifier minimizes total expected error

  • Assigns class with highest posterior: argmax P(C|x) = argmax P(x|C)P(C)

  • Requires knowing true class-conditional densities

Naive Bayes:

  • Assumes features are conditionally independent given class: P(x₁,x₂,…,xₙ|C) = ∏ P(xᵢ|C)

  • Maximum likelihood estimation for parameters from training data

  • Works well even when independence assumption is violated

  • Reference: Naive Bayes Classifier with insect examples

No Free Lunch Theorem:

  • No single learning algorithm dominates all others across all possible problems

  • Algorithm choice depends on problem structure and assumptions

  • Reference: No Free Lunch Theorems (Wolpert & Macready)

Naive Bayes vs k-NN:

  • Naive Bayes: fast training, works well with high-dimensional data, assumes feature independence

  • k-NN: no training phase, captures complex boundaries, sensitive to irrelevant features and curse of dimensionality

Using Mixture of Gaussians

  • Model complex distributions as weighted sum of Gaussians

  • Kernel Density Estimation as non-parametric alternative

  • Cross-validation to avoid overfitting when selecting number of components

Decision Trees

Basics

  • Hierarchical partitioning of feature space via axis-aligned splits

  • Internal nodes: test on attribute; branches: attribute values; leaves: class labels

  • Discrete attributes: multi-way split on values

  • Continuous attributes: binary split on threshold (e.g., x > t)

  • Decision trees are understandable and easy to explain

  • Reference: Decision Trees, Daniel Kohlsdorf

Entropy and Information Gain

Entropy measures impurity of a set S:

H(S) = -Σ pᵢ log₂(pᵢ)

  • H = 0 when all examples belong to one class (pure)

  • H = 1 bit for binary classification with equal split

  • Used to determine optimal attribute for splitting

Information Gain = reduction in entropy from splitting on attribute A:

IG(S, A) = H(S) - Σ (|Sᵥ|/|S|) H(Sᵥ)

  • Select attribute with highest information gain at each node

  • Same attribute can appear at multiple levels in the tree

  • Greedy top-down construction (ID3/C4.5 algorithm)

Minimum Description Length

  • Occam’s Razor applied to trees: prefer simpler models

  • MDL: best hypothesis minimizes description length of data + hypothesis

  • Pruning: remove subtrees that don’t improve generalization

  • Balance tree complexity against training accuracy

Random Forests

  • Ensemble of decision trees trained on bootstrap samples (bagging)

  • Each tree considers random subset of features at each split

  • Final prediction: majority vote across all trees

  • Reduces variance/overfitting compared to single tree

Boosting

  • Ensemble method that combines weak learners sequentially

  • Each new learner focuses on examples misclassified by previous learners

  • Reweight training examples: increase weight of misclassified, decrease correctly classified

  • Final classifier: weighted vote of all weak learners

  • AdaBoost: exponential loss, weight α_t = ½ ln((1-ε_t)/ε_t) where ε_t is weighted error

  • Highly resistant to overfitting in practice (margin theory)

  • References: Tutorial on Boosting, Short Introduction to Boosting (Freund & Schapire)

Neural Networks

Perceptron

  • Single neuron: output a = f(w₀ + Σ wᵢxᵢ) where f is activation function

  • Step activation: a = 1 if w₀ + w·x > 0, else 0

  • Can represent linearly separable functions (AND, OR, NOT, NOR)

  • Cannot represent XOR (not linearly separable)

NOR gate example:

a = { true if w0 + i1*w1 + i2*w2 > 0, else false }

Truth table for NOR:
i1=0, i2=0 → 1
i1=0, i2=1 → 0
i1=1, i2=0 → 0
i1=1, i2=1 → 0

Weights: w0 > 0, w1 < -w0, w2 < -w0 (e.g., w0=0.5, w1=-1.0, w2=-1.0)

Perceptron Learning Rule:

  • w ← w + α(y - ŷ)x where α is learning rate

  • Converges if data is linearly separable (Perceptron Convergence Theorem)

  • No convergence guarantee for non-separable data

Multilayer Networks

  • Hidden layers allow representation of non-linear decision boundaries

  • With one hidden layer and enough units: universal function approximator

  • Sigmoid/logistic activation: σ(z) = 1/(1+e⁻ᶻ) — smooth, differentiable

  • Network topology: input layer → hidden layer(s) → output layer

Back-Propagation

  • Gradient descent on network weights to minimize error

  • Chain rule propagates error gradients from output back through hidden layers

  • Weight update: wᵢⱼ ← wᵢⱼ - α ∂E/∂wᵢⱼ

  • For output layer: δⱼ = (yⱼ - tⱼ)f’(netⱼ)

  • For hidden layer: δⱼ = f’(netⱼ) Σ wⱼₖδₖ

  • Issues: local minima, slow convergence, vanishing gradients in deep networks

  • Reference: Neural Networks Slides

Deep Learning

  • Multiple hidden layers learn hierarchical representations

  • Techniques to train deep nets: pre-training, dropout, batch normalization, ReLU activations

  • Resources:

Unsupervised Learning

k-Means Clustering

  • Partition n points into k clusters by minimizing within-cluster sum of squares

  • Algorithm: (1) initialize k centroids, (2) assign points to nearest centroid, (3) recompute centroids, (4) repeat until convergence

  • Converges to local optimum; sensitive to initialization

  • Hard assignment: each point belongs to exactly one cluster

Expectation Maximization (EM)

  • Generalization of k-means with soft (probabilistic) assignments

  • E-step: compute posterior probability of each point belonging to each cluster

  • M-step: update parameters (means, covariances, mixing weights) using soft assignments

  • Guaranteed to increase likelihood at each iteration; converges to local maximum

  • Reference: Pattern Recognition and Machine Learning (Bishop)

Gaussian Mixture Models (GMM)

  • Model data as mixture of k Gaussians: P(x) = Σ πₖ N(x|μₖ,Σₖ)

  • Parameters: mixing coefficients πₖ, means μₖ, covariance matrices Σₖ

  • Trained using EM algorithm

  • More flexible than k-means: captures elliptical clusters, soft boundaries

  • Application: learning significant locations from GPS data (Ashbrook & Starner)

  • Reference: Using GPS to Learn Significant Locations and Predict Movement Across Multiple Users

Resources

  • Entropy and Information Gain

  • Wild Dolphin Project

  • CHAT (Cetacean Hearing and Telemetry)