Ensemble Learning: Boosting and Bagging¶
Ensemble Learning Overview¶
Core Idea¶
Combine many simple rules (each weak alone) into one strong composite classifier.
Each rule provides partial evidence; none alone is sufficient.
Analogous to decision tree nodes (simple rules combined by structure) or neural network features (combined by weights).
Spam email example — simple indicative rules:
Rule |
Sign |
Notes |
|---|---|---|
Body contains “manly” From spouse Very short / URL only Just an image Misspelled words “Make money fast” |
+spam −spam +spam +spam +spam +spam |
Evidence, not definitive Usually not spam Common spam pattern Pharmaceutical ads Blacklist approach Classic spam phrase |
Each rule is partially right but individually unreliable.
Need a principled way to combine them → ensemble learning.
Ensemble Learning Algorithm (General Form)¶
for each round:
learn a rule on a subset of data
combine all rules into final hypothesis
Two design questions:
How to select subsets of data?
How to combine the learned rules?
Bagging (Bootstrap Aggregation)¶
Subset Selection¶
Uniformly random subsets of training data.
Sample with replacement (bootstrap) — same point can appear multiple times.
Apply a learner to each subset → get hypothesis \(h_t\).
Combination¶
Regression: average (mean) of all hypotheses.
Classification: majority vote (or averaged scores + threshold).
Equal weight to each hypothesis (no reason one random subset is better).
Quiz: Ensemble Learning Outputs
Setup: \(N\) data points; learner = 0th-order polynomial (constant); \(N\) disjoint subsets of 1 point each; combine by mean.
Each learner outputs its single point’s \(y\) value.
Mean of all = mean of dataset = best 0th-order polynomial = constant.
Same as unweighted k-NN with \(k = n\).
Housing Data Example¶
9 data points; hold out 1 (green) for testing.
5 random subsets of 5 points (with replacement); learn 3rd-order polynomial on each; average.
Result (many trials): - Blue line (single 4th-order polynomial on all data): better on training points. - Red line (bagged average of 3rd-order): almost always better on held-out green point.
Bagged ensemble of simpler hypotheses outperforms single complex hypothesis on test data.
Why bagging works:
Random subsets prevent overfitting to individual noisy/outlier points.
Averaging reduces variance — same argument as cross-validation.
Mixing data focuses on important structure, not spurious fits.
Name: Bagging = Bootstrap Aggregation.
Boosting¶
Motivation vs Bagging¶
Bagging |
Boosting |
|
|---|---|---|
Subset selection Combination Learns from mistakes |
Random uniform Unweighted mean/vote No (independent rounds) |
Focus on hard examples Weighted mean/vote Yes (re-weight each round) |
Hard examples: points current ensemble misclassifies.
Intuition: spend effort on what you haven’t mastered (like learning a skill).
Spam analogy: don’t keep finding rules for already-classified mail; focus on misclassified messages.
Weighting prevents thrashing — mastered examples down-weighted so they aren’t forgotten entirely.
Distribution-Weighted Error¶
Standard error (implicit uniform distribution):
Distribution-weighted error:
\(D\) = distribution over examples (importance / frequency).
\(h\) = hypothesis; \(c\) = true concept.
Probability of disagreement, weighted by how often each example appears.
Quiz example — 4 examples, 2 correct / 2 wrong:
Example |
Correct? |
\(P(x)\) |
|---|---|---|
\(x_1\) | ✓ \(x_2\) | ✗ \(x_3\) | ✓ \(x_4\) | ✗ |
½ ¹⁄₂₀ ⁴⁄₁₀ ¹⁄₂₀ |
|
Uniform count: 50% error.
Distribution-weighted: \(\frac{1}{20} + \frac{1}{20} = \frac{1}{10}\) → 10% error (90% correct).
Rare mistakes matter less; frequent mistakes matter more.
Boosting uses distributions to define “hardest” = highest-weight misclassified examples.
Weak Learner¶
Definition: a learner that, for any distribution \(D\) over the data, produces a hypothesis with error strictly better than chance:
\(\epsilon > 0\) = small constant (bounded away from ½).
Always extracts some information regardless of example weighting.
Chance (coin flip) = ½ error = zero information.
Quiz: Weak Learning — 3 hypotheses, 4 examples:
Good distribution (¼ each): H1 gets ¾ correct → weak learner exists.
Evil distribution (½ on \(x_1\), ½ on \(x_2\), 0 elsewhere): all hypotheses get exactly 50% → no weak learner for this hypothesis space.
Fix: add more/better hypotheses (e.g., make H3 correct on \(x_2\)).
Finding a weak learner is easy for 2 seconds, hard for 4 — actually a strong requirement.
Need diverse hypotheses that do well on different example subsets.
Boosting Algorithm (Pseudocode)¶
Input: training set \(\{(x_i, y_i)\}\), labels \(y_i \in \{-1, +1\}\)
Initialize D_1(i) = 1/n for all i // uniform distribution
for t = 1 to T:
Find weak classifier h_t with small error ε_t on distribution D_t
(error_D_t(h_t) ≤ 1/2 - ε)
Compute α_t = (1/2) ln((1 - ε_t) / ε_t) // weight for h_t
Update distribution:
D_{t+1}(i) = (D_t(i) / Z_t) · exp(-α_t · y_i · h_t(x_i))
// Z_t = normalization constant (sums to 1)
Output: H(x) = sign(Σ_t α_t · h_t(x))
Key parameters:
Symbol |
Meaning |
|---|---|
\(D_t\) | Distribution over examples at round \(t\) \(h_t\) | Weak hypothesis at round \(t\) (outputs ±1) \(\epsilon_t\) | Weighted error of \(h_t\) on \(D_t\) \(\alpha_t\) | Weight of \(h_t\) in final vote (always > 0) \(Z_t\) | Normalization constant for valid distribution |
|
Distribution Update¶
\(y_i h_t(x_i) = +1\) when hypothesis agrees with label.
\(y_i h_t(x_i) = -1\) when hypothesis disagrees.
When they agree (\(y_i h_t = +1\)):
Exponent is negative → \(e^{-\alpha_t y_i h_t} < 1\) → weight decreases (relative to misclassified).
When they disagree (\(y_i h_t = -1\)):
Exponent is positive → \(e^{-\alpha_t y_i h_t} > 1\) → weight increases.
Quiz: When D agrees — answer: depends on normalization \(Z_t\) and other examples’ updates. Practically: correct examples decrease weight when at least one other is wrong; wrong examples increase.
Effect: misclassified (hard) examples get more weight next round.
Weak learner guarantee ensures progress on re-weighted hard examples.
Final Hypothesis¶
Weighted vote: better weak learners (lower \(\epsilon_t\)) get higher \(\alpha_t\).
\(\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}\) — measure of hypothesis quality.
Sign function thresholds: positive → +1, negative → −1, zero → undecided.
Information discarded by sign (confidence) becomes important in SVM lecture connection.
Worked Example: Three Boxes¶
Setup: 2D points (+ red, − green) in a square region.
Hypothesis space: axis-aligned semi-planes — horizontal or vertical line; one side = +, other = −.
Round 1 (uniform weights):
Learner returns vertical line separating left 2 pluses from rest.
Correct: 2 left pluses + all minuses. Wrong: 3 right pluses (“Three Plusketeers”).
\(\epsilon_1 = 0.3\); \(\alpha_1 = 0.42\).
Distribution update: wrong pluses ↑ weight; correct points ↓ weight.
Round 2:
Vertical line to right of 3 Plusketeers (everything left = +).
Correct: 3 Plusketeers + 2 left pluses. Wrong: 3 middle minuses (low weight).
\(\epsilon_2 = 0.21\); :math:`alpha_2 = 0.65$.
Distribution: middle minuses ↑; Plusketeers ↓ (but still above original); never-wrong points → near zero.
Round 3 — Quiz: Which Hypothesis?
A: horizontal line above middle minuses (best — separates heavily weighted points).
B: horizontal line lower (worse on weighted points).
C: vertical line (used in round 2, doesn’t fit current weights).
Answer: A. \(\epsilon_3 = 0.14\); \(\alpha_3 = 0.92\).
Final combined hypothesis:
Weighted sum \(\alpha_1 h_1 + \alpha_2 h_2 + \alpha_3 h_3\) passed through sign.
Produces non-axis-aligned curved boundary separating all points perfectly.
Simple axis-aligned weak learners → complex decision boundary via weighted combination + sign nonlinearity.
Analogous to: decision tree shapes, neural net weighted sums, weighted linear regression / k-NN.
Why Boosting Works (Intuition)¶
Re-weight misclassified examples → they become increasingly important.
Weak learner must beat chance on any distribution → forced to address hard examples.
Exponential weighting: examples wrong multiple times become dominant; can’t cycle forever.
Examples consistently correct → weight → 0 (irrelevant to future rounds).
Must accumulate information across rounds; can’t just oscillate on same errors.
Result: final hypothesis gets few weighted errors → converges to good combined classifier.
Formal proof in assigned reading (AdaBoost convergence bound).
Information gain perspective: each round must pick up new information; exponential weights prevent thrashing.
Summary¶
Method |
Key idea |
|---|---|
Ensemble Bagging Boosting Weak learner Distribution Update rule Final output Overfitting |
Combine simple rules from subsets → complex classifier Random subsets + unweighted average → reduce variance Re-weight hard examples + weighted vote → reduce bias Error < ½ − ε for any distribution \(D\) \(D_t(i)\) = importance of example \(i\) Increase weight on misclassified, decrease on correct \(H(x) = \text{sign}(\sum \alpha_t h_t(x))\) Boosting resists overfit (margin effect); fails if weak learner itself overfits |