Decision Trees¶

Most popular technique in data mining and machine learning. Core tension throughout ML: complicated model that fits data well vs. Occam’s Razor — simplest adequate theory.

Occam’s Razor: prefer the simplest theory that explains the data.

Supervised Learning — Classification vs Regression¶

Classification: map input \(x\) → discrete label (true/false, male/female, car vs cougar).

Regression: map input → continuous real value (points on a line, hair length in inches, age 23.7).

Preview: regression covered in more detail in a later lecture; this unit focuses on classification via decision trees.

Classification vs Regression¶

Two types of supervised learning:

| Classification | Regression |

|---|———————|----------------| | Output | Discrete label (small finite set) | Continuous real value | | Examples | Male/female; lend/don’t lend; HS/college/grad | Predict age 23.7; credit score; hair length | | Input | Can be discrete, continuous, or complex | Same |

The task type is determined by the output, not the input.

Quiz examples:

Lend money (yes/no) → classification (binary)
Education level (HS / college / grad) → classification (trinary; still discrete)
Age (possibly fractional, e.g. 22.3) → regression (continuous output)

Note: ages could be treated as 250 discrete year-classes → classification, but predicting 23.7 as a real number is the natural regression formulation.

Supervised Learning Quiz — reasoning:

Problem | Output type | Task |

|---------|————|------| | Lend money? | Binary yes/no | Classification | | Education level (HS/college/grad) | 3 discrete classes | Classification | | Age (22.3 years) | Continuous real | Regression |

For age: even though people don’t usually say “22.3 years old,” fractional ages make regression natural. Discretizing to integer years (1…250) would make it classification with a large but finite label set.

Key Terminology¶

Instances — input vectors (attributes/features): pixels, income, restaurant features, etc.
Concept — function mapping instances → outputs; intuitively, membership in a set (e.g. “maleness”, “creditworthiness”)
Formally: concept = mapping from input space to output space; intuitively = idea describing a set of things
Example: “tallness” maps objects → {tall, not tall}; “car” maps objects → {car, non-car}
Target concept — the true function we want to learn (the actual answer)
Hypothesis class — set of functions we are willing to consider
Could be all possible functions, but with finite data exhaustive search is impossible
Already restricted: excludes regression functions like \(y = x^2\)
For decision trees: all tree-structured functions over our attributes
Training set — labeled input/output pairs used to learn (inductive learning from examples)
Candidate — a hypothesized concept (a specific decision tree)
Testing set — held-out labeled data to evaluate generalization

Critical: training set and testing set must not overlap — testing on training data is “cheating” (memorization ≠ generalization).

Classification Learning — Worked Example¶

Creditworthiness example:

Instances: pictures of people, or financial attributes (income, etc.)
Target concept: creditworthy (true/false) — the function we want to learn
Hypothesis class: initially “all possible functions” is too large; decision trees restrict this

Inductive learning: instead of a dictionary definition of “creditworthy,” provide many labeled examples (input + correct output). Learner must generalize beyond seen examples.

Candidate concept example: “curly hair → creditworthy” — fails when testing set has curly-haired person who is not creditworthy.

Testing: apply candidate to held-out examples; count misclassifications = error. True target concept in this example: whether person smiles (illustrating that candidates can be wrong even on training-like data).

Generalization is the whole point of machine learning — final exams test new instances, not memorized facts.

Hypothesis class note: could be all possible functions, but finite data makes exhaustive search impossible. Decision trees restrict to tree-structured functions; already a subset of all functions (excludes e.g. \(y = x^2\) regression).

Contingency Tables¶

1-D contingency table = histogram
K-D contingency table — pick K attributes \((a_1, \ldots, a_k)\), count each value combination
Database terminology: k-dimensional data cube; OLAP (Online Analytical Processing) views slices and aggregates of database tables

Why contingency tables? Multi-dimensional view of data — understand relationships across several attributes simultaneously.

Combinatorics:

With n attributes, number of m-D contingency tables: \(\binom{n}{m}\) (n choose m)
With 100 attributes, 3-D tables:

\[100 C_{3} = (100 * 99 * 98) / (3 * 2 * 1) = 161700\]

Data Mining & Information Theory¶

Data mining — process of finding patterns in data; decide which patterns are interesting vs. illusory and how to exploit them. Information gain is the engine driving decision tree learning.

Information theory — originally for compression; now used in data mining. Information gain is a sub-topic.

Key quantities (from contingency-table analysis):

\(H(\text{wealth})\) — entropy of wealth attribute alone
\(H(\text{wealth} \mid \text{relation})\) — conditional entropy of wealth given relation
\(IG(\text{wealth} \mid \text{relation})\) — information gain from knowing relation when predicting wealth

High information gain on an attribute → that attribute is predictive of the target.

Learning Decision Trees¶

A decision tree is a tree-structured plan of attribute tests to predict output:

Find attribute with highest information gain
Test it at the current node
Recurse on each branch’s subset

Base cases for recursion:

All records have same output label
No attribute can further subdivide records
No information gain from any remaining attribute (XOR example — no single split helps)

Decision stump — a tree with only the root node (single attribute test). Useful baseline; full trees built by recursive splitting.

XOR base-case problem: if we stop when IG = 0, XOR never splits (every attribute gives zero gain) — illustrates why XOR is hard and why parity needs exponential trees.

Dating Example — Problem Setup¶

Classification task: enter restaurant (yes/no)?

Attributes (features about the situation):

Attribute | Type | Example values |

|-----------|——|----------------| | Type | multi-valued | Italian, French, Thai, American, Greek, … | | Atmosphere | multi-valued | fancy, casual, hole-in-wall | | Occupied | binary | yes / no | | Hot date | binary | yes / no | | Cost | discrete or numeric | cheap / moderate / expensive (or average entrée price) | | Hungry | binary | yes / no | | Raining | binary | yes / no |

Some features relate to the restaurant; others are external (weather, hunger, date context). Irrelevant features (e.g. “cars parked across the country”) should be excluded — tree building via IG naturally deprioritizes them.

Decision Tree Representation¶

Representation vs algorithm: first understand what a decision tree is (data structure); then learn the algorithm (ID3) to build one.

Structure:

Decision nodes (circles) — test an attribute; edges = possible attribute values
Leaf nodes (squares) — output label (yes/no, true/false)
Traverse top-down from root only — never bottom-up

Example path: “Hungry?” → yes → “Raining?” → no → Enter (true)

At leaves, T/F means the output answer, not an attribute value.

Concrete example tree uses only Occupied, Type (pizza/Thai/other), Happiness (patrons look happy):

Occupied = true → check type; pizza → no-go; Thai → go; other → check happiness
Occupied = false → no-go (always)
Hot date, hungry, raining irrelevant for this particular tree

The instance table serves as a testing set; the tree is a candidate concept.

Representation Quiz — Tracing Instances¶

Six restaurant scenarios; trace each from root (Occupied node):

|----------|———-|------|———–|------|——–| | 1 | true | pizza | true | occupied→type(pizza) | No go | | 2 | false | — | — | occupied(false) branch | No go | | 3 | true | Thai | — | occupied→type(Thai) | Go | | 4 | false | — | — | occupied(false) | No go | | 5 | true | Thai | — | occupied→type(Thai) | Go | | 6 | true | other | true | occupied→type(other)→happiness(true) | Go |

Common mistake: starting from a leaf (e.g. happiness node) instead of root — decision trees always evaluate top-down.

Only 3 of 6+ features matter for this tree; hot date, hungry, raining are irrelevant here.

Best Attribute¶

Given training instances labeled red (×) or green (○), rank three possible splits:

Rank | Split behavior | Assessment |

|------|—————|------------| | 1 (best) | All × on one side, all ○ on other | Perfect — done classifying | | 2 | Splits into smaller groups but ~50/50 mix on both sides | Some splitting, no label separation; arguably useless (may overfit) | | 3 (worst) | All data to one side | Does nothing — “kicked the can down the road” |

Good split = reduces uncertainty about labels.

Decision Tree Expressiveness¶

AND / OR / XOR recap (2 attributes):

Function | Min nodes | Separable by one test? |

|----------|———–|----------------------| | AND | 2 | Yes (one split) | | OR | 2 | Yes | | XOR | 3 | No — needs multiple levels |

Boolean functions over \(n\) attributes:

OR (any function) — if any attribute true, output true:

A1? ──true──→ YES
  └──false──→ A2? ──true──→ YES
                     └──false──→ A3? ──true──→ YES
                                    └──false──→ ... ──→ NO

Tree structure: test attributes sequentially until one is true
Size: :math:`O(n)` nodes — linear
Each attribute tested at most once along any path

XOR / Parity (odd function) — output true if odd number of attributes are true:

Must explore essentially all combinations
Size: :math:`O(2^n)` nodes — exponential, hard / evil
No clever attribute ordering avoids exhaustive branching
For \(n=3\): 7 internal nodes (suspiciously \(2^3 - 1\))
For \(n=4\): entire new level added → doubles again

XOR truth table (3 attributes):

A1 | A2 | A3 | XOR |

|----|—-|----|—–| | 0 | 0 | 0 | 0 | | 0 | 0 | 1 | 1 | | 0 | 1 | 0 | 1 | | 0 | 1 | 1 | 0 | | 1 | 0 | 0 | 1 | | 1 | 0 | 1 | 0 | | 1 | 1 | 0 | 0 | | 1 | 1 | 1 | 1 |

Each subtree at an internal node is itself a smaller parity problem.

Representation matters: adding derived attribute B = sum of all Aᵢ (treating true=1, false=0) reduces parity to a simple test on B — but requires choosing a good representation upfront (“cheating” in the strict sense, but good practice). AI adage: hardest problem is finding a good representation.

How Many Decision Trees?¶

Boolean inputs, Boolean output → analyze via truth table:

\(n\) Boolean attributes → \(2^n\) rows (all combinations of true/false)
Each row’s output can be true or false independently
Attribute ordering for internal nodes: combinatorial (roughly factorial in \(n\))
Leaf labels: exponential choices

Total: enormously large — far beyond exhaustive search. ID3’s greedy IG heuristic navigates this space efficiently.

Truth table row count examples:

Attributes \(n\) | Rows \(2^n\) |

|---------------------|—————–| | 1 | 2 | | 2 | 4 | | 3 | 8 | | \(n\) | \(2^n\) |

ID3 Algorithm¶

Greedy top-down tree builder:

function ID3(examples, attributes):
    if all examples same label → return leaf with that label
    if no attributes left → return leaf with majority label
    A ← attribute with maximum information gain
    create node N for attribute A
    for each value v of A:
        create branch for v
        examples_v ← subset of examples where A = v
        if examples_v empty → add leaf with majority label
        else → add subtree ID3(examples_v, attributes - {A})
    return tree rooted at N

ID3 in plain language:

Loop until solved
Pick best attribute (max information gain)
Create node; branch on each attribute value
Sort training examples into branches
If perfectly classified → stop that branch
Otherwise recurse on each branch’s subset

Information Gain

\[IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|}\, H(S_v)\]

\(S\) — current training examples at this node
\(H(S)\) — entropy of class labels in \(S\)
\(H(S_v)\) — entropy of subset where attribute \(A = v\)
Choose attribute A that maximizes IG

Entropy¶

Measure of randomness / uncertainty in labels:

\[H(S) = -\sum_{v} P(v)\,\log P(v)\]

Intuition:

Fair coin (P=½ each): 1 bit of entropy — maximum uncertainty
Two-headed coin: 0 bits — outcome known in advance
All labels same → entropy = 0 (no uncertainty)
Even mix of × and ○ → maximum entropy for that set

Connecting to the three split examples:

All data one side → entropy unchanged → IG = 0
Split in half but same mix both sides → total entropy unchanged → IG = 0
Perfect separation → child entropies = 0 → IG = H(S) (maximum)

Entropy Example — Coin Flips¶

Coin type | P(heads) | Entropy |

|-----------|———-|---------| | Fair (two-sided) | ½ | 1 bit (max uncertainty) | | Two-headed | 1 | 0 bits (no uncertainty) | | Training set: 50% ×, 50% ○ | — | Maximum for that label set | | Training set: 100% × | — | 0 (no uncertainty) |

ID3 Inductive Bias¶

Two kinds of bias:

Restriction bias — hypothesis class = decision trees over discrete attributes (not all functions, not regression formulas like \(y = 2x\)).

Preference bias — among decision trees, ID3 prefers:

Good splits near the top (high information gain first)
Correct trees over incorrect ones
Shorter trees over longer ones (consequence of good top splits — reaches leaves faster)

Given two trees representing the same function, ID3 prefers the one with better top-level splits.

An attribute that puts all data on one side (zero IG) would create a longer, unnecessary tree — ID3 avoids this automatically.

Continuous Attributes¶

Problem: branching on every exact value → infinite branches for truly continuous variables; cannot handle unseen values at test time.

Naive approaches (and why they fail):

Approach | Problem |

|----------|———| | Branch per exact value (1.0, 1.1, 1.11, …) | Infinite branches | | Branch only on training-set values | Unseen test values don’t match any branch | | Use test set to pick splits | Cheating — test set must remain unseen |

Solution: binary splits on ranges:

e.g. age >= 20 AND age < 30 (inclusive of 20, exclusive of 30)
Represented as a single binary decision node

Finding split points:

Examine training set values only
If all values in 20s, no point splitting on 30s range
Binary search on training values: value < median vs value >= median
Evaluate each candidate split with information gain
Many candidates for continuous variables, but finite from training data

Worked example — age splits:

Training ages: {22, 25, 28, 35, 41}
Candidate splits: < 25, < 28, < 35, etc.
Each creates two branches evaluated by IG

Repeating attributes on a path:

Attribute type | Repeat on same path? | Reason |

|---------------|———————|--------| | Discrete | No | Same question → zero information gain | | Continuous | Yes (different ranges) | “Age in 20s?” vs “Age ≥ 40?” are different questions |

Quiz: repeat attribute on same path?

Discrete (e.g. A = true? then later A = true? again): No — zero information gain; IG automatically deprioritizes
Continuous (e.g. age in 20s? then age >= 44?): Yes — different questions about same underlying variable

Other Considerations¶

When to stop?

All training examples correctly classified — but problematic with noisy data (conflicting labels on same instance → never stops)
Run out of attributes
Need better criteria for continuous attributes (infinite possible splits)

Overfitting

Memorizing every training quirk → excellent training accuracy, poor generalization. Violates Occam’s Razor.

When noise exists (mislabeled examples, duplicate instances with conflicting labels):

“Stop when all correct” → never terminates
Must use alternative stopping criteria

Mitigations:

Hold-out validation set / cross-validation * Build tree on training data (excluding validation) * Evaluate on validation set * Pick tree with lowest validation error * Avoids needing a separate stopping rule during growth
Early stopping with validation * Before each expansion, check validation error so far * Stop if error is low enough or starts increasing * Use breadth-first expansion so all branches grow evenly (depth-first can over-expand one branch before testing others)
Pruning (post-hoc, efficient) * Build full tree as if overfitting weren’t a concern * From leaves upward: collapse subtrees into single nodes * Keep collapse if validation error increase is small * Discard collapse if validation error jumps too much * Produces smaller tree; simple addition to standard ID3 * Many pruning variants exist

Stopping criteria summary:

Criterion | Handles noise? | Notes |

|-----------|—————|-------| | All training examples correct | No | Infinite loop with conflicting labels | | No attributes left | Partially | Forced stop | | No information gain | Partially | XOR never splits | | Validation error threshold | Yes | Preferred with continuous attrs | | Pruning post-hoc | Yes | Build full tree first |

Regression Trees¶

So far: discrete outputs with information gain. For regression (continuous outputs):

Split criterion: e.g. variance reduction instead of entropy (measures spread of continuous values)
Leaf prediction: average of training values in leaf, or linear fit
Classification leaves: majority vote when labels mixed
Pruning and validation apply similarly

Leaf output when labels mixed (classification): majority vote. With pruning, even impure leaves need an output — vote generalizes best for maximizing correct classifications.

Regression vs Classification Trees¶

Aspect | Classification tree | Regression tree |

|--------|——————–|-----------------| | Output | Discrete label | Continuous value | | Split criterion | Information gain (entropy) | Variance reduction | | Leaf value | Majority vote | Average (or linear fit) | | Pruning metric | Classification error | Squared error / variance | | Mixed leaves | Vote ≈ average for binary | Average of continuous values |

Basic Decision Tree — Quick Reference¶

Build (ID3):

Start with all examples at root
Choose attribute with max information gain
Split; recurse on each branch
Stop at pure leaf, no attributes, or validation/pruning criterion

Use:

Start at root
Follow edge matching instance’s attribute value
Repeat until leaf → output label

Key formulas:

\[H(S) = -\sum_v P(v) \log P(v)\]

\[IG(S, A) = H(S) - \sum_v \frac{|S_v|}{|S|} H(S_v)\]

\[\binom{n}{m} \text{ contingency tables from } n \text{ attributes, dimension } m\]

Summary¶

Decision trees — interpretable top-down attribute tests; root-to-leaf path = series of decisions
ID3 — greedy recursive builder; best attribute = max information gain (entropy reduction)
Expressiveness — any Boolean function representable; OR is linear, parity/XOR is exponential
Bias — restricted to trees; prefers correct, short trees with informative top splits
Continuous inputs — range-based binary splits from training data; can repeat attribute with different ranges
Overfitting — pruning / validation; don’t trust perfect training classification with noisy data
Regression variant — variance-based splits; average/vote at leaves

References¶

Andrew Ng’s tutorial — self-contained material on decision trees
Weka Tutorial
Information Gain

Further topics (assignments): advanced pruning, multi-class trees, missing values, cost-sensitive splits.