Bayesian Learning¶
Revision reference for MAP/ML learning, Bayes’ rule, noise models, MDL, and Bayesian classification.
Introduction¶
Bayesian learning goal: find the most probable hypothesis given data and domain knowledge.
h: hypothesis; D: training data; Pr: probability.
Bayes’ Rule¶
Term meanings:
Term |
Name |
Role |
|---|---|---|
Pr(h|D) |
Posterior |
What we want — belief in h after data |
Pr(D|h) |
Likelihood |
P(labels | h); often “run h on data” |
Pr(h) |
Prior |
Domain knowledge over hypotheses |
Pr(D) |
Evidence |
Normalizer; often ignored for argmax |
Likelihood intuition: D = {(x_i, d_i)}. Pr(D|h) = probability of seeing labels d_i given inputs x_i if h were true.
Example: h(x) = true iff x ≥ 10. Data: x=7, label=true → Pr(D|h) = 0.
Posterior increases when: prior Pr(h) ↑ or likelihood Pr(D|h) ↑. Pr(D) ↓ also raises posterior but is not h-specific.
Chain rule derivation: P(A,B) = P(A|B)P(B) = P(B|A)P(A) → Bayes’ rule.
Prior as domain knowledge: Pr(h) encodes beliefs before data — analogous to:
k-NN: Euclidean distance assumes nearby points share labels.
Decision trees: feature / information-gain preferences.
Neural nets: architecture choices.
When Pr(h|D) increases: higher Pr(h), higher Pr(D|h), or lower Pr(D).
Bayes’ Rule Quiz: Rare Disease¶
Spleentitis test: sensitivity 98%, specificity 97%, prevalence 0.8%. Patient tests positive. P(disease | +)?
→ P(S|+) ≈ 21% — more likely no disease despite accurate test.
Lesson: priors matter. Test useful only when prior is high enough (screen high-risk populations, not random walk-ins).
MAP and Maximum Likelihood¶
MAP algorithm:
For each h ∈ H: score ∝ Pr(D|h) · Pr(h)
Return h with maximum score.
Drop Pr(D) for argmax (constant w.r.t. h).
Maximum A Posteriori (MAP): uses both likelihood and prior Pr(h).
Maximum Likelihood (ML): uniform prior over H → maximize Pr(D|h) only.
MAP = domain knowledge in prior; ML = all hypotheses equally likely a priori.
Conceptual gold standard but intractable when |H| is large or infinite (e.g., linear separators).
Bayesian Learning: Noise-Free Version Space¶
Assumptions: noise-free labels d_i = c(x_i); true concept c ∈ H; uniform prior over H.
Pr(h) = 1/|H|
Pr(D|h) = 1 if h consistent with all labels, else 0
Pr(D) = |VS| / |H| where VS = version space (consistent hypotheses)
→ All consistent hypotheses equally probable → pick any h ∈ VS (matches earlier learning-theory advice).
Derivation sketch:
Substitute into Bayes’ rule → uniform over VS.
Noisy Data Quiz¶
True process: label = k · x with P(k) = 1/2^k (geometric). Hypothesis h(x) = x (identity).
Data: (1,5), (3,6), (11,11), (12,36), (20,100).
Example: (1,5) → k=5 → 1/32; product over all i = 1/65536.
IID labels → product of per-example likelihoods.
Gaussian Noise → Sum of Squared Errors¶
Setup: d_i = f(x_i) + ε_i, ε_i ~ N(0, σ²) i.i.d.; f unknown; h approximates f.
ML hypothesis maximizes ∏_i Pr(d_i | h, x_i). Gaussian likelihood:
Take log, drop constants, flip max to min:
→ Minimizing sum of squared errors (SSE) = ML under zero-mean Gaussian noise on outputs.
Derivation steps:
ML: argmax_h ∏ᵢ Pr(dᵢ | h, xᵢ)
Log → sum of log-Gaussians
Drop constants (1/√2πσ²) and σ² inside argmax
log exp(·) → sum of squared residuals with negative sign
argmax → argmin Σ(h(xᵢ) − dᵢ)²
Assumptions baked in:
Deterministic f plus i.i.d. Gaussian ε (same σ).
Noise on outputs only (not on inputs) — if x also noisy, SSE may be wrong model.
Different noise models → different loss functions.
Best Hypothesis Quiz¶
Uniform prior over {constant, linear, mod-9}. SSE picks mod-9 for given data (lowest error), beating best linear regression — SSE does not prefer simpler models without explicit penalty.
Quiz data (x → d): (1,1), (2,2), (3,3), (9,9), (10,1), (11,2) — mod-9 fits local structure; best linear reg SSE ≈ 15.8; constant ≈ 19; mod-9 SSE = 12.
Minimum Description Length (MDL)¶
Apply log base 2 to MAP (monotonic → same argmax), negate to minimize:
Information-theoretic interpretation (optimal code length = −log₂ P):
−log₂ Pr(h) = description length of hypothesis (bits to encode h).
−log₂ Pr(D|h) ≈ description length of data given h (errors / mismatches must be transmitted).
→ MAP = minimize error + model complexity → Occam’s razor / MDL.
Shorter decision trees = smaller −log Pr(h) → prefer pruned/smaller trees.
Large neural-net weights need more bits → complexity penalty links to overfitting.
Minimum description length: trade off fit vs simplicity; units of error vs size may need tuning.
Bayesian Classification vs MAP Hypothesis¶
Critical distinction: goal is often best label, not best hypothesis.
Example: three hypotheses with P(h|D) = 0.4, 0.3, 0.3; labels on x: +, −, −.
MAP hypothesis → h₁ → label +
Bayesian optimal label: weighted vote over all h
P(−) = 0.3 + 0.3 = 0.6 > P(+) = 0.4 → predict −.
Bayesian classification:
Analogous to boosting, weighted k-NN: combine hypotheses by posterior weight, don’t commit to single MAP h.