Probabilistic Inference
Download
Report
Transcript Probabilistic Inference
CS B553: ALGORITHMS FOR
OPTIMIZATION AND LEARNING
Structure Learning
AGENDA
Learning probability distributions from example
data
To what extent can Bayes net structure be
learned?
Constraint methods (inferring conditional
independence)
Scoring methods (learning => optimization)
BASIC QUESTION
Given examples drawn from a distribution P*
with independence relations given by the
Bayesian structure G*, can we recover G*?
BASIC QUESTION
Given examples drawn from a distribution P*
with independence relations given by the
Bayesian structure G*, can we recover G*
construct a network that encodes the same
independence relations as G*?
G*
G1
G2
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins
Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
Model 1
X
Model 2
Y
X
Y
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins
Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
Model 1
X
Model 2
Y
X
Y
ML parameters
P(X=H) = 9/20
P(Y=H) = 8/20
P(X=H) = 9/20
P(Y=H|X=H) = 3/9
P(Y=H|X=T) = 5/11
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins
Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
Model 1
X
Model 2
Y
X
Y
ML parameters
P(X=H) = 9/20
P(Y=H) = 8/20
P(X=H) = 9/20
P(Y=H|X=H) = 3/9
P(Y=H|X=T) = 5/11
Errors are
likely to be
larger!
PRINCIPLE
Learning structure must trade off fit of data vs.
complexity of network
Complex networks
More parameters to learn
More data fragmentation = greater sensitivity to
noise
APPROACH #1: CONSTRAINT-BASED
LEARNING
First, identify an undirected skeleton of edges in
G*
If an edge X-Y is in G*, then no subset of evidence
variables can make X and Y independent
If X-Y is not in G*, then we can find evidence
variables to make X and Y independent
Then, assign directionality to preserve
independences
BUILD-SKELETON ALGORITHM
Given X={X1,…,Xn}, query Independent?(X,Y,U)
H = complete graph over X
For all pairs Xi, Xj, test separation as follows:
Enumerate all possible separating sets U
If Independent?(Xi,Xj,U) then remove Xi—Xj from H
In practice:
• Must restrict to bounded size subsets |U|d (i.e.,
assume G* has bounded degree). O(n2(n-2)d) tests
• Independence can’t be tested exactly
ASSIGNING DIRECTIONALITY
Note that V-structures XYZ introduce a
dependency between X and Z given Y
In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y
In fact Y must be given for X and Z to be independent
Idea: look at separating sets for all triples X-Y-Z
in the skeleton without edge X-Z
Triangle
X
Z
Y
Directionality is
irrelevant
ASSIGNING DIRECTIONALITY
Note that V-structures XYZ introduce a
dependency between X and Z given Y
In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y
In fact Y must be given for X and Z to be independent
Idea: look at separating sets for all triples X-Y-Z
in the skeleton without edge X-Z
Triangle
X
Y separates X, Z
Z
Y
Directionality is
irrelevant
X
Z
Y
Not a v-structure
ASSIGNING DIRECTIONALITY
Note that V-structures XYZ introduce a
dependency between X and Z given Y
In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y
In fact Y must be given for X and Z to be independent
Idea: look at separating sets for all triples X-Y-Z
in the skeleton without edge X-Z
Triangle
X
Y separates X, Z
Z
Y
Directionality is
irrelevant
X
YU separates X, Z
Z
Y
Not a v-structure
X
Z
Y
A v-structure
ASSIGNING DIRECTIONALITY
Note that V-structures XYZ introduce a
dependency between X and Z given Y
In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y
In fact Y must be given for X and Z to be independent
Idea: look at separating sets for all triples X-Y-Z
in the skeleton without edge X-Z
Triangle
X
Y separates X, Z
Z
Y
Directionality is
irrelevant
X
YU separates X, Z
Z
Y
Not a v-structure
X
Z
Y
A v-structure
STATISTICAL INDEPENDENCE TESTING
Question: are X and Y independent?
Null hypothesis H0: X and Y are independent
Alternative hypothesis HA: X and Y are not
independent
STATISTICAL INDEPENDENCE TESTING
Question: are X and Y independent?
Null hypothesis H0: X and Y are independent
Alternative hypothesis HA: X and Y are not
independent
2 test: use the statistic
2
𝑀 𝑋 = 𝑥, 𝑌 = 𝑦 − 𝑀 𝑃 𝑥 𝑃 𝑦
=
𝑥,𝑦
with 𝑃 𝑥 =
𝑀 𝑋=𝑥
𝑀
2
𝑀𝑃 𝑥 𝑃 𝑦
the empirical probability of X
Can compute (table lookup) the probability of getting
a value at least this extreme if H0 is true (p-value)
If p < some threshold, e.g. 1-0.95, H0 is rejected
APPROACH #2: SCORE-BASED METHODS
Learning => optimization
Define scoring function Score(G;D) that evaluates
quality of structure G, and optimize it
Combinatorial optimization problem
Issues:
Choice of scoring function: maximum likelihood score,
Bayesian score
Efficient optimization techniques
MAXIMUM-LIKELIHOOD SCORES
ScoreL(G;D) = likelihood of the BN with the most
likely parameter settings under structure G
Let L(G,G;D) be the likelihood of data using
parameters G with structure G
Let G* = arg max L(,G;D) as described in last
lecture
Then ScoreL(G;D) = L(G*,G;D)
ISSUE WITH ML SCORE
ISSUE WITH ML SCORE
Independent coin example
G1
G2
X
Y
X
Y
ML parameters
P(X=H) = 9/20
P(Y=H) = 8/20
P(X=H) = 9/20
P(Y=H|X=H) = 3/9
P(Y=H|X=T) = 5/11
Likelihood score
log L(G1*,G1;D)= 9 log(9/20) + 11 log(11/20) + 8 log (8/20) + 12 log (12/20)
log L(G2*,G2;D)= 9 log(9/20) + 11 log(11/20) +
3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] =
8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] +
12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] =
8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] +
12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
= 20 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − 𝑥 𝑃 𝑥 𝑦 log 𝑃(𝑦|𝑥)]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] =
8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] +
12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
= 20 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − 𝑥 𝑃 𝑥 𝑦 log 𝑃(𝑦|𝑥)]
= 20 𝑦 𝑥 𝑃 𝑥 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − log 𝑃 𝑦 𝑥 ]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] =
8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] +
12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
= 20 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − 𝑥 𝑃 𝑥 𝑦 log 𝑃(𝑦|𝑥)]
= 20 𝑦 𝑥 𝑃(𝑥, 𝑦)[log 𝑃 𝑦 − log 𝑃 𝑦 𝑥 ]
MUTUAL INFORMATION PROPERTIES
𝐼𝑃 (𝑥, 𝑦) (the mutual information between X and Y)
𝑃 𝑦𝑥
𝐼𝑃 𝑥, 𝑦 =
𝑃 𝑥, 𝑦 log
𝑃 𝑦
𝑥,𝑦
=
𝑃 𝑥, 𝑦 log
𝑥,𝑦
𝑃(𝑥, 𝑦)
𝑃(𝑥)𝑃 𝑦
= 𝐷(𝑃| 𝑄 with Q(x,y) = P(x)P(y)
0 by nonnegativity of KL divergence
Implication: ML scores do not decrease
for more connected graphs
=> Overfitting to data!
POSSIBLE SOLUTIONS
Fix complexity of graphs (e.g., bounded in-degree)
See HW7
Penalize complex graphs
Bayesian scores
IDEA OF BAYESIAN SCORING
Note that parameters are uncertain
Bayesian approach: put a prior on parameter
values and marginalize them out
P(D|G) =
𝜃𝐺
𝑃 𝐷 𝜃𝐺 , 𝐺 𝑃 𝜃𝐺 𝐺 𝑑𝜃𝐺
For example, use Beta/Dirichlet priors =>
marginal is manageable to compute
E.g., uniform hyperparameter over network
Set virtual counts to 2^-|PaXi|
LARGE SAMPLE APPROXIMATION
log P(D|G) =
log L(G*;D) – ½ log M Dim[G] + O(1)
With M the number of samples, Dim[G] the
number of free parameters of G
Bayesian Information Criterion (BIC) score:
ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]
LARGE SAMPLE APPROXIMATION
log P(D|G) =
log L(G*;D) – ½ log M Dim[G] + O(1)
With M the number of samples, Dim[G] the
number of free parameters of G
Bayesian Information Criterion (BIC) score:
ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]
Fit data set
Prefer simple models
STRUCTURE OPTIMIZATION, GIVEN A
SCORE…
The problem is well-defined, but combinatorially
complex!
Superexponential in # of variables
Idea: search locally through the space of graphs
using graph operators
Add edge
Delete edge
Reverse edge
SEARCH STRATEGIES
Greedy
Pick operator that leads to greatest score
Local minima? Plateaux?
Overcoming plateaux
Search with basin flooding
Tabu search
Perturbation methods (similar to simulated
annealing, except on data weighting)
Implementation details:
Evaluate ’s between structures quickly (local
decomposibility)
RECAP
Bayes net structure learning: from equivalence
class of networks that encode the same
conditional independences
Constraint-based methods
Statistical independence tests
Score-based methods
Learning => optimization