Rich Probabilistic Models for Genomic Data

Download Report

Transcript Rich Probabilistic Models for Genomic Data

Maximum Likelihood Estimation
& Expectation Maximization
Lectures 3 – Oct 5, 2011
CSE 527 Computational Biology, Fall 2011
Instructor: Su-In Lee
TA: Christopher Miles
Monday & Wednesday 12:00-1:20
Johnson Hall (JHN) 022
1
Outline

Probabilistic models in biology

Model selection problem

Mathematical foundations

Bayesian networks

Learning from data



Maximum likelihood estimation
Maximum a posteriori (MAP)
Expectation and maximization
2
Parameter Estimation

Assumptions



For example,
{i0,d1,g1,l0,s0}
Fixed network structure
Fully observed instances of the network variables: D={d[1],…,d[M]}
Maximum likelihood estimation (MLE)!
“Parameters” of the
Bayesian network
from Koller & Friedman
3
The Thumbtack example

Parameter learning for a single variable.

Variable



X: an outcome of a thumbtack toss
Val(X) = {head, tail}
X
Data

A set of thumbtack tosses: x[1],…x[M]
4
Maximum likelihood estimation

Say that P(x=head) = Θ, P(x=tail) = 1-Θ


Definition: The likelihood function


P(HHTTHHH…<Mh heads, Mt tails>; Θ) =
L(Θ : D) = P(D; Θ)
Maximum likelihood estimation (MLE)

Given data D=HHTTHHH…<Mh heads, Mt tails>, find Θ
that maximizes the likelihood function L(Θ : D).
5
Likelihood function
6
MLE for the Thumbtack problem

Given data D=HHTTHHH…<Mh heads, Mt tails>,


MLE solution Θ* = Mh / (Mh+Mt ).
Proof:
7
Continuous space


Assuming sample x1, x2,…, xn is from a
parametric distribution f (x|Θ) , estimate Θ.
Say that the n samples are from a normal
distribution with mean μ and variance σ2.
8
Continuous space (cont.)

Let Θ1=μ, Θ2= σ2
L(1 , 2 : x1 , x2 ,..., xn ) 
log L(1 , 2 : x1 , x2 ,..., xn ) 

log L(1 ,  2 : x1 , x2 ,..., xn ) 
1

log L(1 ,  2 | x1 , x2 ,..., xn ) 
 2
9
Any Drawback?

Is it biased?
Is it? Yes. As an extreme, when n = 1, ˆ2 =0.
ˆ
 The MLE  2 systematically underestimates θ2 .
Why? A bit harder to see, but think about n = 2. Then
θ1 is exactly between the two sample points, the
position that exactly minimizes the expression
for .
ˆ2
Any other choices for θ1, θ2 make the likelihood of the
observed data slightly lower. But it’s actually pretty
unlikely that two sample points would be chosen
ˆ2
exactly equidistant from, and on opposite sides of the
mean, so the MLE
systematically underestimates θ2 .

10
Maximum a posteriori

Incorporating priors. How?

MLE vs MAP estimation
11
MLE for general problems

Learning problem setting


A set of random variables X from unknown distribution P*
Training data D = M instances of X: {d[1],…,d[M]}

A parametric model P(X; Θ) (a ‘legal’ distribution)

Define the likelihood function:


L(Θ : D) =
Maximum likelihood estimation

Choose parameters Θ* that satisfy:
12
MLE for Bayesian networks
Structure G
x1
x2
PG = P(x1,x2,x3,x4)
= P(x1) P(x2) P(x3|x1,x2) P(x4|x1,x3)
More generally?
x3
x4
PG   P( x i | pai )
i
Parameters θ
Θx1, Θx2 , Θx3|x1,x2 , Θx4|x1,x3
(more generally Θxi|pai)
Given D: x[1],…x[m]…,x[M], estimate θ.
(x1[m],x2[m],x3[m],x4[m])

Likelihood decomposition:

The local likelihood function for Xi is:
13
Bayesian network with table CPDs
The Student example
The Thumbtack example
X
Intelligence Difficulty
vs
Grade
P(X)
Joint distribution
Parameters
Data
θ
D: {H…x[m]…T}
Likelihood function
L(θ:D) = P(D;θ)
MLE solution
P(I,D,G) =
θI, θD, θG|I,D
D: {(i1,d1,g1)…(i[m],d[m],g[m])…}
θMh(1-θ)Mt
θˆ 
Mh
Mh  Mt
14
Maximum Likelihood Estimation Review



Find parameter estimates which make observed
data most likely
General approach, as long as tractable likelihood
function exists
Can use all available information
15
Example – Gene Expression


Instruction for making the proteins
Instruction for when and where to make them
“Coding” Regions
“Regulatory” Regions (Regulons)








Regulatory regions contain “binding sites” (6-20 bp).
“Binding sites” attract a special class of proteins, known as
“transcription
factors”.
What turns genes
on (producing a protein) and off?
Bound
factors
can initiate transcription (making RNA).
When istranscription
a gene turned
on or off?
Proteins
inhibit
canon?
also be bound to their
Where (inthat
which
cells)transcription
is a gene turned
binding
sites.
How many
copies of the gene product are produced?
16
Regulation of Genes
Transcription Factor
(Protein)
RNA polymerase
(Protein)
DNA
CG..A
AC..T
Regulatory Element
(binding sites)
source: M. Tompa, U. of Washington
Gene
17
Regulation of Genes
Transcription Factor
(Protein)
RNA polymerase
(Protein)
DNA
CG..A
AC..T
Regulatory Element
source: M. Tompa, U. of Washington
Gene
18
Regulation of Genes
Transcription Factor
(Protein)
RNA
polymerase
DNA
CG..A
AC..T
Regulatory Element
source: M. Tompa, U. of Washington
Gene
19
Regulation of Genes
Transcription Factor
DNA
CG..A
RNA
polymerase
AC..T
Regulatory Element
source: M. Tompa, U. of Washington
20
New protein
The Gene regulation example


What determines the expression level of a gene?
What are observed and hidden variables?

e.G, e.TF’s: observed; Process.G: hidden variables  want to infer!
Expression level
of TF1
Biological process the
gene is involved in
e.TF1
e.TF2
e.TF3
e.TF4
...
e.TFN
Process.G
= p321
e.G
Expression level
of a gene
21
The Gene regulation example


What determines the expression level of a gene?
What are observed and hidden variables?


e.G, e.TF’s: observed; Process.G: hidden variables  want to infer!
How about BS.G’s? How deterministic is the sequence of a binding site?
How much do we know?
e.TF1
e.TF2
e.TF3
e.TF4
...
e.TFN
Process.G
BS1.G
...
= Yes
Whether the gene has
TF1’s binding site
BSN.G
= Yes
e.G
Expression level
of a gene
22
Not all data are perfect


Most MLE problems are simple to solve with
complete data.
Available data are “incomplete” in some way.
23
Outline

Learning from data



Maximum likelihood estimation (MLE)
Maximum a posteriori (MAP)
Expectation-maximization (EM) algorithm
24
Continuous space revisited

Assuming sample x1, x2,…, xn is from a mixture of
parametric distributions,
x1 x2 … xm
xm+1 … xn
x
25
A real example

CpG content of human gene promoters
GC frequency
“A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two 26
distinct classes of promoters” Saxonov, Berg, and Brutlag, PNAS 2006;103:1412-1417
Mixture of Gaussians
Parameters θ
means
variances
mixing parameters
P.D.F
L( 1 ,  2 ,  ,  , 1 , 2 : x1 ,..., xn )
2
1
2
2
27
Apply MLE?
L( 1 ,  2 ,  12 ,  22 , 1 , 2 : x1 ,..., xn )


No closed form solution known for finding θ
maximizing L.
However, what if we knew the hidden data?
28
EM as Chicken vs Egg

IF zij known, could estimate parameters θ


e.g., only points in cluster 2 influence μ2, σ2.
IF parameters θ known, could estimate zij

e.g., if |xi - μ1|/σ1 << |xi – μ2|/σ2, then zi1 >> zi2
Convergence provable? YES

BUT we know neither; (optimistically) iterate:



E-step: calculate expected zij, given parameters
M-step: do “MLE” for parameters (μ,σ), given E(zij)
Overall, a clever “hill-climbing” strategy
29
“Classification EM”

If zij < 0.5, pretend it’s 0; zij > 0.5, pretend it’s 1
i.e., classify points as component 0 or 1

Now recalculate θ, assuming that partition

Then recalculate zij , assuming that θ

Then recalculate θ, assuming new zij , etc., etc.
30
Full EM




xi’s are known; Θ unknown. Goal is to find the MLE Θ of:
L (Θ : x1,…,xn )
(hidden data likelihood)
Would be easy if zij’s were known, i.e., consider
L (Θ : x1,…,xn, z11,z12,…,zn2 )
(complete data likelihood)
But zij’s are not known.
Instead, maximize expected likelihood of observed data
E[ L(Θ : x1,…,xn, z11,z12,…,zn2 ) ]
where expectation is over distribution of hidden data (zij’s).
31
The E-step

Find E(zij), i.e., P(zij=1)

Assume θ known & fixed. Let




A: the event that xi was drawn from f1
B: the event that xi was drawn from f2
D: the observed data xi
Then, expected value of zi1 is P(A|D)
P(A|D) =
32
Complete data likelihood

Recall:

so, correspondingly,

Formulas with “if’s” are messy; can we blend
more smoothly?
33
M-step

Find θ maximizing E[ log(Likelihood) ]
34
EM summary

Fundamentally an MLE problem

Useful if analysis is more tractable when 0/1

Hidden data z known

Iterate:
E-step: estimate E(z) for each z, given θ
M-step: estimate θ maximizing E(log likelihood) given E(z)
where “E(logL)” is wrt random z ~ E(z) = p(z=1)
35
EM Issues




EM is guaranteed to increase likelihood with
every E-M iteration, hence will converge.
But may converge to local, not global, max.
Issue is intrinsic (probably), since EM is often
applied to NP-hard problems (including clustering,
above, and motif-discovery, soon)
Nevertheless, widely used, often effective
36
Acknowledgement

Profs Daphne Koller & Nir Friedman,
“Probabilistic Graphical Models”

Prof Larry Ruzo, CSE 527, Autumn 2009

Prof Andrew Ng, ML lecture note
37