Transcript ppt

Learning Bayesian networks
Slides by Nir Friedman
.
Learning Bayesian networks
B
E
Data +
Prior information
Inducer
R
A
C
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Known Structure -- Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
Inducer
B
E
A
Network structure is specified
 Data contains missing values


B
E
We consider assignments to missing values
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Known Structure / Complete Data
 Given

a network structure G
And choice of parametric family for P(Xi|Pai)
 Learn
parameters for network from complete data
Goal
 Construct a network that is “closest” to probability
distribution that generated the data
Maximum Likelihood Estimation in
Binomial Data
 Applying
the MLE principle we get
NH
ˆ

NH  NT
(Which coincides with what one would expect)
(NH,NT ) = (3,2)
MLE estimate is 3/5 = 0.6
L( :D)
Example:
0
0.2
0.4
0.6
0.8
1
Learning Parameters for a Bayesian
Network
 Training
data has the form:
B
E
A
 E [1] B [1] A[1] C [1] 
 





D
 


 


E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]


C
Learning Parameters for a Bayesian
Network
 Since
we assume i.i.d. samples,
likelihood function is
L( : D )   P (E [m], B [m], A[m],C [m] : )
m
B
E
A
C
Learning Parameters for a Bayesian
Network
 By
B
E
definition of network, we get
A
L( : D )   P (E [m ], B [m ], A[m ], C [m ] : )
C
m
P (E [m ] : )

m
P (B [m ] : )
P (A[m ] | B [m ], E [m ] : )
P (C [m ] | A[m ] : )
 E [1] B [1] A[1] C [1] 
 






 


 


E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]


Learning Parameters for a Bayesian
Network
 Rewriting
B
E
terms, we get
A
L( : D )   P (E [m ], B [m ], A[m ], C [m ] : )
C
m
  P (E [m ] : )
m
 P (B [m] : )
m
 P (A[m] | B [m], E [m] : )
m
 P (C [m] | A[m] : )
m
 E [1] B [1] A[1] C [1] 
 






 


 


E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]


General Bayesian Networks
Generalizing for any Bayesian network:
L( : D )   P (x 1 [m ], , xn [m ] : )
i.i.d. samples
m
  P (xi [m ] | Pai [m ] : i )
m
i
i
m
Network factorization
  P (xi [m ] | Pai [m ] : i )
  Li (i : D )
i
 The
likelihood decomposes according to the
structure of the network.
General Bayesian Networks (Cont.)
Complete Data  Decomposition
 Independent Estimation Problems
If the parameters for each family are not related, then
they can be estimated independently of each other.
(Not true in Genetic Linkage analysis).
Learning Parameters: Summary
 For
multinomial we collect sufficient statistics
which are simply the counts N (xi,pai)
 Parameter estimation
N (xi , pai )
ˆ
x |pa 
i
i
N ( pai )
MLE
 Bayesian
(xi , pai )  N (xi , pai )
~
x |pa 
i
i
( pai )  N ( pai )
Bayesian (Dirichlet Prior)
methods also require choice of priors
 Both MLE and Bayesian are asymptotically
equivalent and consistent.
Known Structure -- Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
Inducer
B
E
A
Network structure is specified
 Data contains missing values


B
E
We consider assignments to missing values
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Learning Parameters from Incomplete

Data
X
m
X[m]
Y|X=H
Y|X=T
Y[m]
Incomplete data:
 Posterior distributions can become interdependent
 Consequence:
 ML parameters can not be computed separately
for each multinomial
 Posterior is not a product of independent
posteriors
Learning Parameters from Incomplete
Data (cont.).
 In
the presence of incomplete data, the likelihood
can have multiple global maxima
H
Y
 Example:


We can rename the values of hidden variable
H
If H has two values, likelihood has two global
maxima
 Similarly,
local maxima are also replicated
 Many hidden variables  a serious problem
MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem
L(|D)

Gradient Ascent:
Follow gradient of likelihood w.r.t. to parameters

MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem
L(|D)

Expectation Maximization (EM):

Use “current point” to construct alternative function (which is “nice”)
Guaranty: maximum of new function is better scoring than the current point
MLE from Incomplete Data
Both Ideas:
Find local maxima only.
Require multiple restarts to find approximation to the global maximum.
Gradient Ascent
 Main
result
Theorem GA:
 log P (D | )
1

xi , pai
xi , pai
 P (xi , pai | o [m], )
m
Requires computation: P(xi,pai|o[m],) for all i, m
Inference replaces taking derivatives.
Gradient Ascent (cont)
Proof:
log P (D | )

xi , pai
log P (o [m ] | )

xi , pai
m
1
P (o [m ] | )
 
xi , pai
m P (o [m ] | )
P (o [m ] | )
How do we compute
 xi , pai
?
Gradient Ascent (cont)
Since:
P (o | ) 
 P (x , pa ,o | )
i
i
xi , pai
P(xi ,pai ,o |)
P(o |)

x'i ,pa'i xi ,pai
x'i ,pa'i
P(od |xi ,pai ,ond ,)P(xi |pai ,)P(pai ,ond |)

x'i ,pa'i
xi ,pai
P(o |x'i ,pa'i ,o ,)P(pa'i ,o |)
d
nd
nd
P(x'i |pa'i ,)
x' ,pa'
i

P(x'i ,pa'i ,o,)
x' ,pa'
i
i
i
=1
Gradient Ascent (cont)
 Putting
all together we get
 log P( D | )
1
P(o[m] | )

 xi , pai
 xi , pai
m P (o[ m] | )
P( xi , pai , o[m] | )
1

 xi , pai
m P (o[ m] | )

P( xi , pai | o[m], )
 x , pa
m
 log P (D | )
1

xi , pai
xi , pai
i
i
 P (xi , pai | o [m], )
m
Expectation Maximization (EM)
A general purpose method for learning from incomplete data
Intuition:
 If we had access to counts, then we can estimate
parameters
 However, missing values do not allow to perform counts
 “Complete” counts using current parameter assignment

Expectation Maximization (EM)
X
Z
Y
Data
P(Y=H|X=H,Z=T,) = 0.3
Current
model
X Y
Z
H
T
H
H
T
T
T
?
T
H
P(Y=H|X=T, Z=T, ) = 0.4
?
?
H
T
T
Expected Counts
N (X,Y )
X Y #
H H 1.3
T H 0.4
H T 1.7
T T 1.6
These numbers are placed for illustration;
they have not been computed.
EM (cont.)
Reiterate
Initial network (G,0)
X1
X2
H
Y1
Y2

Expected Counts
X3
Computation
Y3
Training
Data
(E-Step)
N(X1)
N(X2)
N(X3)
N(H, X1, X1, X3)
N(Y1, H)
N(Y2, H)
N(Y3, H)
Updated network (G,1)
X1
Reparameterize
(M-Step)
X2
X3
H
Y1
Y2
Y3
Expectation Maximization (EM)
 In
practice, EM converges rather quickly at start but
converges slowly near the (possibly-local) maximum.
 Hence, often EM is used few iterations and then
Gradient Ascent steps are applied.
Final Homework
Question 1: Develop an algorithm that given a pedigree input, provides
the most probably haplotype of each individual in the pedigree.
Use the Bayesian network model of superlink to formulate the
problem exactly as a query. Specify the algorithm at length
discussing as many details as you can. Analyze its efficiency.
Devote time to illuminating notation and presentation.
Question 2: Specialize the formula given in Theorem GA for  in
genetic linkage analysis. In particular, assume exactly 3 loci:
Marker 1, Disease 2, Marker 3, with  being the recombination
between loci 2 and 1 and 0.1-  being the recombination between
loci 3 and 2.
1.
Specify the formula for a pedigree with two parents and two
children.
2.
Extend the formula for arbitrary pedigrees.
Note that  is the same in many local probability tables.