Probabilistic Inference

Download Report

Transcript Probabilistic Inference

CS B553: ALGORITHMS FOR
OPTIMIZATION AND LEARNING
Structure Learning
AGENDA
Learning probability distributions from example
data
 To what extent can Bayes net structure be
learned?
 Constraint methods (inferring conditional
independence)
 Scoring methods (learning => optimization)

BASIC QUESTION

Given examples drawn from a distribution P*
with independence relations given by the
Bayesian structure G*, can we recover G*?
BASIC QUESTION

Given examples drawn from a distribution P*
with independence relations given by the
Bayesian structure G*, can we recover G*
construct a network that encodes the same
independence relations as G*?
G*
G1

G2

LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins
 Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

Model 1
X
Model 2
Y
X
Y
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins
 Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

Model 1
X
Model 2
Y
X
Y
ML parameters
P(X=H) = 9/20
P(Y=H) = 8/20
P(X=H) = 9/20
P(Y=H|X=H) = 3/9
P(Y=H|X=T) = 5/11
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins
 Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

Model 1
X
Model 2
Y
X
Y
ML parameters
P(X=H) = 9/20
P(Y=H) = 8/20
P(X=H) = 9/20
P(Y=H|X=H) = 3/9
P(Y=H|X=T) = 5/11
Errors are
likely to be
larger!
PRINCIPLE
Learning structure must trade off fit of data vs.
complexity of network
 Complex networks

More parameters to learn
 More data fragmentation = greater sensitivity to
noise

APPROACH #1: CONSTRAINT-BASED
LEARNING

First, identify an undirected skeleton of edges in
G*
If an edge X-Y is in G*, then no subset of evidence
variables can make X and Y independent
 If X-Y is not in G*, then we can find evidence
variables to make X and Y independent


Then, assign directionality to preserve
independences
BUILD-SKELETON ALGORITHM
Given X={X1,…,Xn}, query Independent?(X,Y,U)
 H = complete graph over X
 For all pairs Xi, Xj, test separation as follows:

Enumerate all possible separating sets U
 If Independent?(Xi,Xj,U) then remove Xi—Xj from H

In practice:
• Must restrict to bounded size subsets |U|d (i.e.,
assume G* has bounded degree). O(n2(n-2)d) tests
• Independence can’t be tested exactly
ASSIGNING DIRECTIONALITY

Note that V-structures XYZ introduce a
dependency between X and Z given Y
In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y
 In fact Y must be given for X and Z to be independent


Idea: look at separating sets for all triples X-Y-Z
in the skeleton without edge X-Z
Triangle
X
Z
Y
Directionality is
irrelevant
ASSIGNING DIRECTIONALITY

Note that V-structures XYZ introduce a
dependency between X and Z given Y
In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y
 In fact Y must be given for X and Z to be independent


Idea: look at separating sets for all triples X-Y-Z
in the skeleton without edge X-Z
Triangle
X
Y separates X, Z
Z
Y
Directionality is
irrelevant
X
Z
Y
Not a v-structure
ASSIGNING DIRECTIONALITY

Note that V-structures XYZ introduce a
dependency between X and Z given Y
In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y
 In fact Y must be given for X and Z to be independent


Idea: look at separating sets for all triples X-Y-Z
in the skeleton without edge X-Z
Triangle
X
Y separates X, Z
Z
Y
Directionality is
irrelevant
X
YU separates X, Z
Z
Y
Not a v-structure
X
Z
Y
A v-structure
ASSIGNING DIRECTIONALITY

Note that V-structures XYZ introduce a
dependency between X and Z given Y
In structures XYZ, XYZ, and XYZ, X and
Z are independent given Y
 In fact Y must be given for X and Z to be independent


Idea: look at separating sets for all triples X-Y-Z
in the skeleton without edge X-Z
Triangle
X
Y separates X, Z
Z
Y
Directionality is
irrelevant
X
YU separates X, Z
Z
Y
Not a v-structure
X
Z
Y
A v-structure
STATISTICAL INDEPENDENCE TESTING
Question: are X and Y independent?
 Null hypothesis H0: X and Y are independent
 Alternative hypothesis HA: X and Y are not
independent

STATISTICAL INDEPENDENCE TESTING




Question: are X and Y independent?
Null hypothesis H0: X and Y are independent
Alternative hypothesis HA: X and Y are not
independent
2 test: use the statistic
2
𝑀 𝑋 = 𝑥, 𝑌 = 𝑦 − 𝑀 𝑃 𝑥 𝑃 𝑦
=
𝑥,𝑦
with 𝑃 𝑥 =


𝑀 𝑋=𝑥
𝑀
2
𝑀𝑃 𝑥 𝑃 𝑦
the empirical probability of X
Can compute (table lookup) the probability of getting
a value at least this extreme if H0 is true (p-value)
If p < some threshold, e.g. 1-0.95, H0 is rejected
APPROACH #2: SCORE-BASED METHODS
Learning => optimization
 Define scoring function Score(G;D) that evaluates
quality of structure G, and optimize it



Combinatorial optimization problem
Issues:
Choice of scoring function: maximum likelihood score,
Bayesian score
 Efficient optimization techniques

MAXIMUM-LIKELIHOOD SCORES

ScoreL(G;D) = likelihood of the BN with the most
likely parameter settings under structure G
Let L(G,G;D) be the likelihood of data using
parameters G with structure G
 Let G* = arg max L(,G;D) as described in last
lecture


Then ScoreL(G;D) = L(G*,G;D)
ISSUE WITH ML SCORE
ISSUE WITH ML SCORE

Independent coin example
G1
G2
X
Y
X
Y
ML parameters
P(X=H) = 9/20
P(Y=H) = 8/20
P(X=H) = 9/20
P(Y=H|X=H) = 3/9
P(Y=H|X=T) = 5/11
Likelihood score
log L(G1*,G1;D)= 9 log(9/20) + 11 log(11/20) + 8 log (8/20) + 12 log (12/20)
log L(G2*,G2;D)= 9 log(9/20) + 11 log(11/20) +
3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] =
8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] +
12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] =
8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] +
12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
= 20 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − 𝑥 𝑃 𝑥 𝑦 log 𝑃(𝑦|𝑥)]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] =
8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] +
12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
= 20 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − 𝑥 𝑃 𝑥 𝑦 log 𝑃(𝑦|𝑥)]
= 20 𝑦 𝑥 𝑃 𝑥 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − log 𝑃 𝑦 𝑥 ]
ISSUE WITH ML SCORE
G1
G2
X
Y
X
Y
Likelihood score
log L(G1*,G1;D)-log L(G2*,G2;D) = 8 log (8/20) + 12 log (12/20)
– 8 [3/8 log (3/9) + 5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] =
8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] +
12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]
= 20 𝑦 𝑃(𝑦)[log 𝑃 𝑦 − 𝑥 𝑃 𝑥 𝑦 log 𝑃(𝑦|𝑥)]
= 20 𝑦 𝑥 𝑃(𝑥, 𝑦)[log 𝑃 𝑦 − log 𝑃 𝑦 𝑥 ]
MUTUAL INFORMATION PROPERTIES
𝐼𝑃 (𝑥, 𝑦) (the mutual information between X and Y)
𝑃 𝑦𝑥
𝐼𝑃 𝑥, 𝑦 =
𝑃 𝑥, 𝑦 log
𝑃 𝑦
𝑥,𝑦
=
𝑃 𝑥, 𝑦 log
𝑥,𝑦
𝑃(𝑥, 𝑦)
𝑃(𝑥)𝑃 𝑦
= 𝐷(𝑃| 𝑄 with Q(x,y) = P(x)P(y)
 0 by nonnegativity of KL divergence
Implication: ML scores do not decrease
for more connected graphs
=> Overfitting to data!
POSSIBLE SOLUTIONS

Fix complexity of graphs (e.g., bounded in-degree)


See HW7
Penalize complex graphs

Bayesian scores
IDEA OF BAYESIAN SCORING
Note that parameters are uncertain
 Bayesian approach: put a prior on parameter
values and marginalize them out



P(D|G) =
𝜃𝐺
𝑃 𝐷 𝜃𝐺 , 𝐺 𝑃 𝜃𝐺 𝐺 𝑑𝜃𝐺
For example, use Beta/Dirichlet priors =>
marginal is manageable to compute
E.g., uniform hyperparameter  over network
 Set virtual counts to  2^-|PaXi|

LARGE SAMPLE APPROXIMATION
log P(D|G) =
log L(G*;D) – ½ log M Dim[G] + O(1)
 With M the number of samples, Dim[G] the
number of free parameters of G
 Bayesian Information Criterion (BIC) score:


ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]
LARGE SAMPLE APPROXIMATION
log P(D|G) =
log L(G*;D) – ½ log M Dim[G] + O(1)
 With M the number of samples, Dim[G] the
number of free parameters of G
 Bayesian Information Criterion (BIC) score:


ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]
Fit data set
Prefer simple models
STRUCTURE OPTIMIZATION, GIVEN A
SCORE…

The problem is well-defined, but combinatorially
complex!


Superexponential in # of variables
Idea: search locally through the space of graphs
using graph operators
Add edge
 Delete edge
 Reverse edge

SEARCH STRATEGIES

Greedy
Pick operator that leads to greatest  score
 Local minima? Plateaux?


Overcoming plateaux
Search with basin flooding
 Tabu search
 Perturbation methods (similar to simulated
annealing, except on data weighting)


Implementation details:

Evaluate ’s between structures quickly (local
decomposibility)
RECAP
Bayes net structure learning: from equivalence
class of networks that encode the same
conditional independences
 Constraint-based methods



Statistical independence tests
Score-based methods

Learning => optimization