Learning: Structure
Download
Report
Transcript Learning: Structure
Structure Learning
in Bayesian Networks
Eran Segal
Weizmann Institute
Structure Learning Motivation
Network structure is often unknown
Purposes of structure learning
Discover the dependency structure of the domain
Goes beyond statistical correlations between individual variables and
detects direct vs. indirect correlations
Set expectations: at best, we can recover the structure up to the Iequivalence class
Density estimation
Estimate a statistical model of the underlying distribution and use it
to reason with and predict new instances
Advantages of Accurate Structure
X1
Spurious edge
X1
X2
Y
X2
Y
Missing edge
X1
X2
Y
Increases number of
fitted parameters
Cannot be compensated
by parameter estimation
Wrong causality and
domain structure
assumptions
Wrong causality and
domain structure
assumptions
Structure Learning Approaches
Constraint based methods
View the Bayesian network as representing dependencies
Find a network that best explains dependencies
Limitation: sensitive to errors in single dependencies
Score based approaches
View learning as a model selection problem
Define a scoring function specifying how well the model fits the data
Search for a high-scoring network structure
Limitation: super-exponential search space
Bayesian model averaging methods
Average predictions across all possible structures
Can be done exactly (some cases) or approximately
Constraint Based Approaches
Goal: Find the best minimal I-Map for the domain
G is an I-Map for P if I(G)I(P)
Minimal I-Map if deleting an edge from G renders it not an I-Map
G is a P-Map for P if I(G)=I(P)
Strategy
Query the distribution for independence relationships that
hold between sets of variables
Construct a network which is the best minimal I-Map for P
Constructing Minimal I-Maps
Reverse factorization
theorem
n
P( X 1 ,..., X n ) P( X i | Pa( X i )) G is an I-Map of P
i 1
Algorithm for constructing a minimal I-Map
Fix an ordering of nodes X1,…,Xn
Select parents of Xi as minimal subset of X1,…,Xi-1,
such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi))
Limitations
(Outline of) Proof of minimal I-map
Independence
involve a above
large number
variables
I-map since queries
the factorization
holds byofconstruction
Minimal since
by construction,
removing
one edge
destroys
Construction
involves
a large number
of queries
(2i-1 subsets)
the factorization
We do not know the ordering and network is sensitive to it
Constructing P-Maps
Simplifying assumptions
Network has bound in-degree d per node
Oracle can answer Ind. queries of up to 2d+2 variables
Distribution P has a P-Map
Algorithm
Step I: Find skeleton
Step II: Find immoral set of v-structures
Step III: Direct constrained edges
Step I: Identifying the Skeleton
For each pair X,Y query all Z for Ind(X;Y | Z)
X–Y is in skeleton if no Z is found
If graph in-degree bounded by d running time O(n2d+2)
Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y))
Reminder
If there is no Z for which Ind(X;Y | Z) holds,
then XY or YX in G*
Proof: Assume no Z exists, and G* does not have XY or YX
Then, can find a set Z such that the path from X to Y is blocked
Then, G* implies Ind(X;Y | Z) and since G* is a P-Map
Contradiction
Step II: Identifying Immoralities
For each pair X,Y query candidate triplets X,Y,Z
XZY if no W is found that contains Z and Ind(X;Y | W)
If graph in-degree bounded by d running time O(n2d+3)
If W exists, Ind(X;Y|W), and XZY not immoral, then ZW
Reminder
If there is no W such that Z is in W and Ind(X;Y | W),
then XZY is an immorality
Proof: Assume no W exists but X–Z–Y is not an immorality
Then, either XZY or XZY or XZY exists
But then, we can block X–Z–Y by Z
Then, since X and Y are not connected, can find W that includes Z
such that Ind(X;Y | W)
Contradiction
Answering Independence Queries
Basic query
Determine whether two variables are independent
Well studied question in statistics
Common to frame query as hypothesis testing
Null hypothesis is H0
H0: Data was sampled from P*(X,Y)=P*(X)P*(Y)
Need a procedure that will Accept or Reject the hypothesis
2 test to assess deviance of data from hypothesis
^
d 2 ( D)
x, y
^
M [ x, y ] M P ( x ) P ( y )
^
^
M P( x) P( y )
Alternatively, use mutual information between X and Y
Structure Based Approaches
Strategy
Define a scoring function for each candidate structure
Search for a high scoring structure
Key: choice of scoring function
Likelihood based scores
Bayesian based scores
Likelihood Scores
Goal: find (G,) that maximize the likelihood
ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G
Find G that maximizes ScoreL(G:D)
Example
X
X
Y
Y
ScoreL (G0 : D) log ˆx[ m ] log ˆy[ m ]
m
ScoreL (G1 : D) log ˆx[ m ] log ˆy[ m ]| x[ m ]
m
ScoreL (G1 : D) ScoreL (G0 : D) log ˆy[ m ]| x[ m ] log ˆy[ m ]
m
M [ x, y ] log ˆy| x M [ y ] log ˆy
x, y
y
M Pˆ ( x, y ) log Pˆ ( y | x) M Pˆ ( y ) log Pˆ ( y )
x, y
y
x, y
x, y
M Pˆ ( x, y ) log Pˆ ( y | x) M Pˆ ( x, y ) log Pˆ ( y )
M I Pˆ ( X , Y )
General Decomposition
The Likelihood score decomposes
as:
n
n
ScoreL (G : D) M I Pˆ ( X i , Pa GXi ) M H Pˆ ( X i )
i 1
Proof:
i 1
ˆ
ScoreL (G : D) M [ xi , ui ] log xi |ui
i 1 ui Val ( PaG
X i ) xi
n
1
M
M [ x , u ] log ˆ
i
ui
xi
i
xi |ui
Pˆ ( xi , ui ) log Pˆ ( xi | ui )
ui
xi
Pˆ ( xi , ui ) Pˆ ( xi )
ˆ
P( xi , ui ) log
ˆ
ˆ
ui xi
P(ui ) P( xi )
Pˆ ( xi , ui )
ˆ
ˆ
log Pˆ ( xi )
P( xi , ui ) log
P
(
x
,
u
)
i
i
ˆ
ˆ
ui xi
P(ui ) P( xi ) xi ui
I Pˆ ( X i , U i ) Pˆ ( xi ) log Pˆ ( xi )
xi
I Pˆ ( X i , U i ) H Pˆ ( X i )
General Decomposition
The Likelihood score decomposes as:
n
n
ScoreL (G : D) M I Pˆ ( X i , Pa ) M H Pˆ ( X i )
G
Xi
i 1
i 1
Second term does not depend on network structure and
thus is irrelevant for selecting between two structures
Score increases as mutual information, or strength of
dependence between connected variable increases
After some manipulation can show:
n
ScoreL (G : D) H Pˆ ( X 1 ,..., X n ) I Pˆ ( X i ,{ X 1 ,... X i 1} Pa GXi | Pa GXi )
i 1
To what extent are the implied Markov assumptions valid
Limitations of Likelihood Score
X
X
Y
Y
G0
G1
ScoreL (G1 : D) ScoreL (G0 : D) M I Pˆ ( X , Y )
Since IP(X,Y)0 ScoreL(G1:D)ScoreL(G0:D)
Adding arcs always helps
Maximal scores attained for fully connected network
Such networks overfit the data (i.e., fit the noise in the data)
Avoiding Overfitting
Classical problem in machine learning
Solutions
Restricting the hypotheses space
Minimum description length
Limits the overfitting capability of the learner
Example: restrict # of parents or # of parameters
Description length measures complexity
Prefer models that compactly describes the training data
Bayesian methods
Average over all possible parameter values
Use prior knowledge
Bayesian Score
Marginal likelihood
Prior over structures
P( D | G ) P(G )
P(G | D)
P( D)
Marginal probability of Data
P(D) does not depend on the network
Bayesian Score: ScoreB (G : D) log P( D | G) log P(G)
Marginal Likelihood of Data Given G
Bayesian Score: ScoreB (G : D) log P( D | G) log P(G)
Likelihood
Prior over parameters
P( D | G ) P( D | G, G ) P( G | G )d G
G
Note similarity to maximum likelihood score, but with the key
difference that ML finds maximum of likelihood and here we
compute average of the terms over parameter space
Marginal Likelihood: Binomial Case
Assume a sequence of m coin tosses
By the chain rule for probabilities
P( x[1], , x[m]) P( x[1]) ... P( x[m] | x[1], , x[m 1])
Recall that for Dirichlet priors
M Hm H
P( x[m 1] H | x[1], , x[m])
m H T
Where MmH is number of heads in first m examples
P( x[1],..., x[m])
[ H ... ( H M H 1)][T ... (T M T 1)]
... ( M 1)
Marginal Likelihood: Binomial Case
P( x[1],..., x[m])
[ H ... ( H M H 1)][T ... (T M T 1)]
... ( M 1)
Simplify using(x+1)=x(x)
( )( 1) ( M 1)
( M )
( )
( ) ( H M H ) ( T M T )
P( x[1],..., x[m])
( M )
( H )
( T )
For multinomials with Dirichlet prior
k
( i M [ x i ])
( )
P( x[1],..., x[m])
( M ) i 1
( i )
Marginal Likelihood Example
Actual experiment with P(H) = 0.25
-0.6
-0.7
(log P(D)) / M
-0.8
-0.9
-1
Dirichlet(.5,.5)
Dirichlet(1,1)
Dirichlet(5,5)
-1.1
-1.2
-1.3
0
5
10
15
20
25
M
30
35
40
45
50
Marginal Likelihood: BayesNets
Network structure
determines form of
marginal likelihood
1
2
3
4
5
6
7
X
H
T
T
H
T
H
H
Y
H
T
H
H
T
T
H
Network 1: Two Dirichlet marginal likelihoods
P(X[1],…,X[7])
P(Y[1],…,Y[7])
Network
X
Y
Marginal Likelihood: BayesNets
Network structure
determines form of
marginal likelihood
1
2
3
4
5
6
7
X
H
T
T
H
T
H
H
Y
H
T
H
H
T
T
H
Network 2: Three Dirichlet marginal likelihoods
Network
P(X[1],…,X[7])
P(Y[1]Y[4]Y[6]Y[7])
X
P(Y[2]Y[3]Y[5])
Y
Idealized Experiment
P(X = H) = 0.5
P(Y = H|X = H) = 0.5 + p
P(Y = H|X = T) = 0.5 - p
-1.3
(logP(D))/M
-1.35
-1.4
-1.45
-1.5
-1.55
-1.6
Independent
-1.65
P = 0.05
P = 0.10
P = 0.15
P = 0.20
-1.7
-1.75
-1.8
1
10
M
100
1000
Marginal Likelihood: BayesNets
The marginal likelihood has the form:
P( D | G )
i
where
paiG
paG
i
paG M [ pa Gi ]
i
xi
( x , paG M [ xi , pa Gi ])
i
i
( x , paG )
i
i
Dirichlet Marginal Likelihood
For the sequence of values of Xi when
Xi’s parents have a particular value
M(..) are the counts from the data
(..) are hyperparameters for each family
Priors
Bayesian Score: ScoreB (G : D) log P( D | G) log P(G)
Structure prior P(G)
Uniform prior: P(G) constant
Prior penalizing number of edges: P(G) c|G| (0<c<1)
Normalizing constant across networks is similar and can
thus be ignored
Priors
Bayesian Score: ScoreB (G : D) log P( D | G) log P(G)
Parameter prior P(|G)
BDe prior
M0: equivalent sample size
B0: network representing the prior probability of events
Set (xi,paiG) = M0 P(xi,paiG| B0)
Note: paiG are not the same as parents of Xi in B0
Compute P(xi,paiG| B0) using standard inference in B0
BDe has the desirable property that I-equivalent networks have the
same Bayesian score when using the BDe prior for some M’ and P’
Bayesian Score: Asymptotic Behavior
For M, a network G with Dirichlet priors satisfies
log M
log P( D | G ) l (ˆG : D)
Dim (G ) O(1)
2
Dim(G): number of independent parameters in G
Approximation is called BIC score
log M
ScoreBIC (G : D) l (ˆG : D)
Dim (G)
2
n
n
log M
M I Pˆ ( X i , Pa X i ) M H Pˆ ( X i )
Dim (G)
2
i 1
i 1
Score exhibits tradeoff between fit to data and complexity
Mutual information grows linearly with M while complexity
grows logarithmically with M
As M grows, more emphasis is given to the fit to the data
Bayesian Score: Asymptotic Behavior
For M, a network G with Dirichlet priors satisfies
log M
ˆ
log P( D | G ) l ( G : D)
Dim (G ) O(1)
2
Bayesian score is consistent
As M, the true structure G* maximizes the score
Spurious edges will not contribute to likelihood and will be penalized
Required edges will be added due to linear growth of likelihood term
relative to M compared to logarithmic growth of model complexity
Summary: Network Scores
Likelihood, MDL, (log) BDe have the form
Score(G : D) Score( X i | PaGi : D)
i
BDe requires assessing prior network
Can naturally incorporate prior knowledge
BDe is consistent and asymptotically equivalent (up to
a constant) to BIC/MDL
All are score-equivalent
G I-equivalent to G’
Score(G) = Score(G’)
Optimization Problem
Input:
Training data
Scoring function (including priors, if needed)
Set of possible structures
Including prior knowledge about structure
Output:
A network (or networks) that maximize the score
Key Property:
Decomposability: the score of a network is a sum of terms.
Score(G : D) Score( X i | PaGi : D)
i
Learning Trees
Trees
At most one parent per variable
Why trees?
Elegant math
we can solve the optimization problem efficiently
(with a greedy algorithm)
Sparse parameterization
avoid overfitting while adapting to the data
Learning Trees
Let p(i) denote parent of Xi, or 0 if Xi has no parent
We can write the score as
Score(G : D) Score( X i : Pai )
i
Score( X : X
Score( X : X
i: p ( i ) 0
i: p ( i ) 0
Improvement over
“empty” network
i
p (i )
i
Score( X )
) Score( X ) Score( X )
)
p (i )
i
i: p ( i ) 0
i
i
i
Score of “empty”
network
Score = sum of edge scores + constant
Learning Trees
Algorithm
Construct graph with vertices: 1,...n
Set w(ij) = Score(Xj | Xi ) - Score(Xj)
Find tree (or forest) with maximal weight
This can be done using standard algorithms in low-order polynomial
time by building a tree in a greedy fashion
(Kruskal’s maximum spanning tree algorithm)
Theorem: Procedure finds the tree with maximal score
When score is likelihood, then w(ij) is proportional
to I(Xi; Xj). This is known as the Chow & Liu method
Learning Trees: Example
MINVOLSET
PULMEMBOLUS
Tree learned from data
of Alarm network
PAP
KINKEDTUBE
INTUBATION
SHUNT
VENTMACH
VENTLUNG
DISCONNECT
VENITUBE
PRESS
MINOVL
VENTALV
FIO2
PVSAT
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
Correct edges
CVP
PCWP
LVFAILURE
STROEVOLUME
ARTCO2
EXPCO2
INSUFFANESTH
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
HR
HREKG
ERRCAUTER
HRSAT
HRBP
Spurious edges
BP
Not every edge in tree is in the the original network
Tree direction is arbitrary --- we can’t learn about arc direction
Beyond Trees
Problem is not easy for more complex networks
Example: Allowing two parents, greedy algorithm is no
longer guaranteed to find the optimal network
In fact, no efficient algorithm exists
Theorem:
Finding maximal scoring network structure with at most k
parents for each variable is NP-hard for k>1
Fixed Ordering
For any decomposable scoring function Score(G:D)
and ordering the maximal scoring network has:
PaiG arg max U{ X j : X j X i } Score( X i | U i : D)
(since choice at Xi does not constrain other choices)
For fixed ordering we have independent problems
If we bound the in-degree per variable by d, then
complexity is exponential in d
Heuristic Search
We address the problem by using heuristic search
Define a search space:
nodes are possible structures
edges denote adjacency of structures
Traverse this space looking for high-scoring structures
Search techniques:
Greedy hill-climbing
Best first search
Simulated Annealing
...
Heuristic Search
Typical operations:
A
B
C
A
B
D
C
A
B
D
A
B
C
C
D
D
Exploiting Decomposability
A
A
B
C
C
D
B
A
B
D
A
B
C
C
D
D
Caching: To update the score after a local
change, we only need to rescore the families
that were changed in the last move
Greedy Hill Climbing
Simplest heuristic local search
Start with a given network
At each iteration
empty network
best tree
a random network
Evaluate all possible changes
Apply change that leads to best improvement in score
Reiterate
Stop when no modification improves score
Each step requires evaluating O(n) new changes
Greedy Hill Climbing Pitfalls
Greedy Hill-Climbing can get stuck in:
Local Maxima
Plateaus
All one-edge changes reduce the score
Some one-edge changes leave the score unchanged
Happens because equivalent networks received the same score and
are neighbors in the search space
Both occur during structure search
Standard heuristics can escape both
Random restarts
TABU search
Equivalence Class Search
Idea
Search the space of equivalence classes
Equivalence classes can be represented by PDAGs (partially
ordered graph)
Advantages
PDAGs space has fewer local maxima and plateaus
There are fewer PDAGs than DAGs
Equivalence Class Search
Evaluating changes is more expensive
In addition to search, need to score a consistent network
X
Y
Original PDAG
Z
New PDAG
Add Y—Z
Consistent DAG
X
Y
Z
X
Y
Z
Score
These algorithms are more complex to implement
Learning Example: Alarm Network
2
True Structure/BDe M' = 10
Unknown Structure/BDe M' = 10
KL Divergence
1.5
1
0.5
0
0
500
1000
1500
2000
2500
M
3000
3500
4000
4500
5000
Model Selection
So far, we focused on single model
Find best scoring model
Use it to predict next example
Implicit assumption:
Best scoring model dominates the weighted sum
Pros:
Valid with many data instances
We get a single structure
Allows for efficient use in our tasks
Cons:
We are committing to the independencies of a particular structure
Other structures might be as probable given the data
Model Selection
Density estimation
One structure may suffice, if its joint distribution is similar to
other high scoring structures
Structure discovery
Define features f(G) (e.g., edge, sub-structure, d-sep query)
Compute P( f | D) f (G) P(G | D)
G
Still requires summing over exponentially many structures
Model Averaging Given an Order
Assumptions
Known total order of variables
Maximum in-degree for variables d
Marginal likelihood
Using decomposability
assumption on prior P(G|)
P( D | G) P(G | )
exp FamScore ( X | Pa : D
expFamScore ( X | U : D
P( D | )
GGd ,
GGd ,
i
B
G
Xi
i
i
U{U:U X i ,|U| d }
Cost per family: O(nd)
Total cost: O(nd+1)
B
i
i
Since given ordering , parent
choices are independent
Model Averaging Given an Order
Posterior probability of a general feature
P( f , D | )
P ( f | , D)
P( D | )
i
f (G) P( D | G) P(G | )
GGd ,
expFamScore
B
U{U:U X i ,|U| d }
( X i | U i : D
Posterior probability of particular choice of parents
P( Pa U | D, )
G
Xi
expFamScoreB ( X i | U : D
expFamScoreB ( X i | U i : D
U{U:U X i ,|U| d }
Posterior probability of particular edge choice
expFamScoreB ( X i | U i : D
P( X j Pa GXi | D, )
U{U: X j U and U X i ,|U| d }
expFamScore
U{U:U X i ,|U| d }
B
( X i | U i : D
All terms
cancel out
Model Averaging
We cannot assume that order is known
Solution: Sample from posterior distribution of P(G|D)
Estimate feature probability by P( f | D) 1
K
Sampling can be done by MCMC
Sampling can be done over orderings
K
f (G
i 1
d
)
Notes on Learning Local Structures
Define score with local structures
Prior may need to be extended
Example: in tree CPDs, score decomposes by leaves
Example: in tree CPDs, penalty for tree structure per CPD
Extend search operators to local structure
Example: in tree CPDs, we need to search for tree structure
Can be done by local encapsulated search or by defining
new global operations
Search: Summary
Discrete optimization problem
In general, NP-Hard
Need to resort to heuristic search
In practice, search is relatively fast (~100 vars in ~10 min):
Decomposability
Sufficient statistics
In some cases, we can reduce the search problem to
an easy optimization problem
Example: learning trees