Transcript ppt
Learning Bayesian networks
Slides by Nir Friedman
.
Learning Bayesian networks
B
E
Data +
Prior information
Inducer
R
A
C
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Known Structure -- Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
Inducer
B
E
A
Network structure is specified
Data contains missing values
B
E
We consider assignments to missing values
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Known Structure / Complete Data
Given
a network structure G
And choice of parametric family for P(Xi|Pai)
Learn
parameters for network from complete data
Goal
Construct a network that is “closest” to probability
distribution that generated the data
Maximum Likelihood Estimation in
Binomial Data
Applying
the MLE principle we get
NH
ˆ
NH NT
(Which coincides with what one would expect)
(NH,NT ) = (3,2)
MLE estimate is 3/5 = 0.6
L( :D)
Example:
0
0.2
0.4
0.6
0.8
1
Learning Parameters for a Bayesian
Network
Training
data has the form:
B
E
A
E [1] B [1] A[1] C [1]
D
E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]
C
Learning Parameters for a Bayesian
Network
Since
we assume i.i.d. samples,
likelihood function is
L( : D ) P (E [m], B [m], A[m],C [m] : )
m
B
E
A
C
Learning Parameters for a Bayesian
Network
By
B
E
definition of network, we get
A
L( : D ) P (E [m ], B [m ], A[m ], C [m ] : )
C
m
P (E [m ] : )
m
P (B [m ] : )
P (A[m ] | B [m ], E [m ] : )
P (C [m ] | A[m ] : )
E [1] B [1] A[1] C [1]
E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]
Learning Parameters for a Bayesian
Network
Rewriting
B
E
terms, we get
A
L( : D ) P (E [m ], B [m ], A[m ], C [m ] : )
C
m
P (E [m ] : )
m
P (B [m] : )
m
P (A[m] | B [m], E [m] : )
m
P (C [m] | A[m] : )
m
E [1] B [1] A[1] C [1]
E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]
General Bayesian Networks
Generalizing for any Bayesian network:
L( : D ) P (x 1 [m ], , xn [m ] : )
i.i.d. samples
m
P (xi [m ] | Pai [m ] : i )
m
i
i
m
Network factorization
P (xi [m ] | Pai [m ] : i )
Li (i : D )
i
The
likelihood decomposes according to the
structure of the network.
General Bayesian Networks (Cont.)
Complete Data Decomposition
Independent Estimation Problems
If the parameters for each family are not related, then
they can be estimated independently of each other.
(Not true in Genetic Linkage analysis).
Learning Parameters: Summary
For
multinomial we collect sufficient statistics
which are simply the counts N (xi,pai)
Parameter estimation
N (xi , pai )
ˆ
x |pa
i
i
N ( pai )
MLE
Bayesian
(xi , pai ) N (xi , pai )
~
x |pa
i
i
( pai ) N ( pai )
Bayesian (Dirichlet Prior)
methods also require choice of priors
Both MLE and Bayesian are asymptotically
equivalent and consistent.
Known Structure -- Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
Inducer
B
E
A
Network structure is specified
Data contains missing values
B
E
We consider assignments to missing values
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Learning Parameters from Incomplete
Data
X
m
X[m]
Y|X=H
Y|X=T
Y[m]
Incomplete data:
Posterior distributions can become interdependent
Consequence:
ML parameters can not be computed separately
for each multinomial
Posterior is not a product of independent
posteriors
Learning Parameters from Incomplete
Data (cont.).
In
the presence of incomplete data, the likelihood
can have multiple global maxima
H
Y
Example:
We can rename the values of hidden variable
H
If H has two values, likelihood has two global
maxima
Similarly,
local maxima are also replicated
Many hidden variables a serious problem
MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem
L(|D)
Gradient Ascent:
Follow gradient of likelihood w.r.t. to parameters
MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem
L(|D)
Expectation Maximization (EM):
Use “current point” to construct alternative function (which is “nice”)
Guaranty: maximum of new function is better scoring than the current point
MLE from Incomplete Data
Both Ideas:
Find local maxima only.
Require multiple restarts to find approximation to the global maximum.
Gradient Ascent
Main
result
Theorem GA:
log P (D | )
1
xi , pai
xi , pai
P (xi , pai | o [m], )
m
Requires computation: P(xi,pai|o[m],) for all i, m
Inference replaces taking derivatives.
Gradient Ascent (cont)
Proof:
log P (D | )
xi , pai
log P (o [m ] | )
xi , pai
m
1
P (o [m ] | )
xi , pai
m P (o [m ] | )
P (o [m ] | )
How do we compute
xi , pai
?
Gradient Ascent (cont)
Since:
P (o | )
P (x , pa ,o | )
i
i
xi , pai
P(xi ,pai ,o |)
P(o |)
x'i ,pa'i xi ,pai
x'i ,pa'i
P(od |xi ,pai ,ond ,)P(xi |pai ,)P(pai ,ond |)
x'i ,pa'i
xi ,pai
P(o |x'i ,pa'i ,o ,)P(pa'i ,o |)
d
nd
nd
P(x'i |pa'i ,)
x' ,pa'
i
P(x'i ,pa'i ,o,)
x' ,pa'
i
i
i
=1
Gradient Ascent (cont)
Putting
all together we get
log P( D | )
1
P(o[m] | )
xi , pai
xi , pai
m P (o[ m] | )
P( xi , pai , o[m] | )
1
xi , pai
m P (o[ m] | )
P( xi , pai | o[m], )
x , pa
m
log P (D | )
1
xi , pai
xi , pai
i
i
P (xi , pai | o [m], )
m
Expectation Maximization (EM)
A general purpose method for learning from incomplete data
Intuition:
If we had access to counts, then we can estimate
parameters
However, missing values do not allow to perform counts
“Complete” counts using current parameter assignment
Expectation Maximization (EM)
X
Z
Y
Data
P(Y=H|X=H,Z=T,) = 0.3
Current
model
X Y
Z
H
T
H
H
T
T
T
?
T
H
P(Y=H|X=T, Z=T, ) = 0.4
?
?
H
T
T
Expected Counts
N (X,Y )
X Y #
H H 1.3
T H 0.4
H T 1.7
T T 1.6
These numbers are placed for illustration;
they have not been computed.
EM (cont.)
Reiterate
Initial network (G,0)
X1
X2
H
Y1
Y2
Expected Counts
X3
Computation
Y3
Training
Data
(E-Step)
N(X1)
N(X2)
N(X3)
N(H, X1, X1, X3)
N(Y1, H)
N(Y2, H)
N(Y3, H)
Updated network (G,1)
X1
Reparameterize
(M-Step)
X2
X3
H
Y1
Y2
Y3
Expectation Maximization (EM)
In
practice, EM converges rather quickly at start but
converges slowly near the (possibly-local) maximum.
Hence, often EM is used few iterations and then
Gradient Ascent steps are applied.
Final Homework
Question 1: Develop an algorithm that given a pedigree input, provides
the most probably haplotype of each individual in the pedigree.
Use the Bayesian network model of superlink to formulate the
problem exactly as a query. Specify the algorithm at length
discussing as many details as you can. Analyze its efficiency.
Devote time to illuminating notation and presentation.
Question 2: Specialize the formula given in Theorem GA for in
genetic linkage analysis. In particular, assume exactly 3 loci:
Marker 1, Disease 2, Marker 3, with being the recombination
between loci 2 and 1 and 0.1- being the recombination between
loci 3 and 2.
1.
Specify the formula for a pedigree with two parents and two
children.
2.
Extend the formula for arbitrary pedigrees.
Note that is the same in many local probability tables.