Bayes Nets¶
Bayes nets combine probability with efficient graph structures to model how uncertain variables influence each other.
Notation: Capital letters denote variables; lowercase denotes the variable being true; a leading \(\neg\) denotes false.
Structure and Parameters¶
A Bayes network is a directed acyclic graph (DAG) where:
Each node is a random variable with a conditional probability table (CPT)
Edges encode direct causal/probabilistic dependencies
The full joint distribution factorizes as:
For a single binary variable with one binary parent, we need three parameters: \(P(\text{evidence})\), \(P(\text{child} | \text{parent})\), and \(P(\text{child} | \neg\text{parent})\).
Computing Bayes Rule¶
Compute the unnormalized posterior, then normalize:
This avoids computing \(P(B)\) directly — the normalizer is recovered from the unnormalized terms themselves.
Two-Test Cancer Example¶
Given \(P(C) = 0.01\), \(P(+|C) = 0.9\), \(P(+|\neg C) = 0.2\), and two conditionally independent positive tests:
P(C | ++) = ?
P'(C|++) = P(+|C) · P(+|C) · P(C) = 0.9 × 0.9 × 0.01
P'(¬C|++) = P(+|¬C) · P(+|¬C) · P(¬C) = 0.2 × 0.2 × 0.99
P(C|++) = P'(C|++) / [P'(C|++) + P'(¬C|++)]
n1 = 0.9 × 0.9 × 0.01
d1 = 0.2 × 0.2 × 0.99
P(C|++) = n1 / (n1 + d1) = 0.1698
Two positive tests raise the posterior from ~4.3% (one test) to ~17%.
Conditional Independence¶
Conditional independence is the central structural property of Bayes networks.
\(X \perp Y \mid Z\) means \(X\) and \(Y\) are conditionally independent given \(Z\):
Key pattern — common cause (\(B \leftarrow A \rightarrow C\)):
Without observing \(A\): \(B\) and \(C\) are independent
Given \(A\): \(B\) and \(C\) become dependent (both conditioned on the common cause)
To verify conditional independence computationally, apply total probability to marginalize hidden variables and check whether the joint factors into marginals.
Absolute vs. Conditional Independence¶
Absolute independence (\(X \perp Y\)) and conditional independence (\(X \perp Y \mid Z\)) are distinct properties. Neither implies the other.
Confounding Cause¶
When two effects share a common hidden cause (fork: \(B \leftarrow A \rightarrow C\)), the effects appear correlated. Conditioning on the cause \(A\) renders them independent.
Explaining Away¶
When two independent causes produce a common effect (collider: \(A \rightarrow C \leftarrow B\)):
Before observing \(C\): \(A \perp B\) (causes are marginally independent)
After observing \(C\): \(A \not\perp B \mid C\) (knowing the effect makes the causes dependent — observing one “explains away” the other)
This extends transitively: observing any descendant of the collider also activates the dependency between the causes.
D-Separation¶
D-separation determines conditional independence in a Bayes network from graph structure alone.
Triplet types (for path \(X - Z - Y\)):
Structure |
\(Z\) unknown |
\(Z\) known |
|---|---|---|
Causal chain: \(X \to Z \to Y\) |
Active (dependent) |
Inactive (independent) |
Common cause: \(X \leftarrow Z \to Y\) |
Active (dependent) |
Inactive (independent) |
Common effect: \(X \to Z \leftarrow Y\) |
Inactive (independent) |
Active (dependent) |
Rules:
Active triplets → variables are dependent
Inactive triplets → variables are independent
Two nodes are d-separated (conditionally independent) given evidence \(E\) iff every undirected path between them contains at least one inactive triplet
Probabilistic Inference¶
Given a Bayes network, inference answers queries of the form \(P(Q \mid E = e)\) where:
Query variables \(Q\) — what we want to compute
Evidence variables \(E\) — what we observe
Hidden variables \(H\) — neither query nor evidence; must be marginalized out
The output is a probability distribution over the query variables.
Inference is direction-sensitive: networks flowing from causes to effects are easier to reason about.
Enumeration¶
Enumerate all possible assignments to hidden variables:
where \(\alpha\) is the normalizing constant and the product is over all variables evaluated at their assigned values.
Define \(f(e, a)\) as the product of all relevant CPT entries for a particular assignment. The final answer sums \(f\) over all \(2^n\) joint assignments of \(n\) binary hidden variables.
Optimization — pulling out terms: Factor terms that don’t depend on the innermost summation variable outside the sum. This reduces the cost per row (though the number of rows stays the same).
Variable Elimination¶
Exact inference is NP-hard in general. Variable elimination improves on enumeration by:
Choosing a variable to eliminate (marginalize out)
Computing a factor by summing over that variable in relevant CPTs
Replacing the original terms with the smaller factor
Repeating until only query variables remain
Each elimination step shrinks the effective network. With a good elimination ordering, variable elimination is significantly more efficient than full enumeration — though finding the optimal ordering is itself NP-hard.
Approximate Inference — Sampling¶
When exact inference is intractable, sampling methods provide approximate answers.
Direct Sampling¶
Sample each variable in topological order from its CPT given its parents’ sampled values. Count outcomes to estimate any probability.
In the limit of infinite samples, estimates converge to true probabilities (consistency)
Works for full joint distributions or individual marginals
Does not require knowing the conditional probabilities analytically — just the ability to sample from the process
Rejection Sampling¶
To estimate \(P(Q \mid E = e)\): generate samples from the full joint, reject any sample where evidence variables don’t match \(e\), and count query outcomes in accepted samples.
Problem: When evidence is unlikely, most samples are rejected — exponentially inefficient.
Likelihood Weighting¶
Fix evidence variables to their observed values; sample only non-evidence variables. Weight each sample by:
Use weighted counts to estimate the posterior. This is a consistent estimator and avoids rejecting any samples.
Limitation: Only accounts for evidence upstream of query variables in the sampling order; downstream evidence is handled through weights, not the sampling distribution itself.
Gibbs Sampling¶
A Markov Chain Monte Carlo (MCMC) method that accounts for all evidence:
Initialize all non-evidence variables randomly
Repeat: pick one non-evidence variable, resample it conditioned on its Markov blanket (parents, children, and children’s other parents)
After a burn-in period, collect samples
Properties:
Considers both upstream and downstream evidence via the Markov blanket
Successive samples are correlated (not independent)
Consistent — converges to the true distribution in the limit