CS 395T: Computational Statistics with Application to

Download Report

Transcript CS 395T: Computational Statistics with Application to

Computational Statistics with
Application to Bioinformatics
Prof. William H. Press
Spring Term, 2008
The University of Texas at Austin
Unit 16: Hidden Markov Models
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
1
Unit 16: Hidden Markov Models (Summary)
•
Markov models
–
–
–
discrete states, discrete time steps
transition between states at each time by specified probabilities
how to find periodicity or reducibility
•
–
•
“irreducibility and aperiodicity imply ergodicity”
Hidden Markov Models
–
we don’t observe the states, but instead “symbols” from them
•
–
–
forward (backward) pass incorporates past (future) data at each time
example: gene finding in the galaxy Zyzyx
•
•
(fewer irrelevant complications than in our galaxy)
use NR3’s HMM class
Baum-Welch re-estimation
–
–
–
–
uses the data to improve the estimate of the transition and symbol probabilities
it’s an EM method!
pure “unsupervised learning”
we try it on the Zyzyx data
•
•
which have a probabilistic distribution for each state
forward-backward algorithm estimates the states from the data
•
•
take high power of the matrix by successive squares method
can find the right answers starting from amazingly crude guesses!
Hidden Semi-Markov Models [aka Generalized HMMs (GHMMs)]
–
–
–
give each state a specified residence time distribution
can be implemented by expanding the number of states in an HMM
we try it on Zyzyx and get somewhat better gene finding
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
2
Markov Models
•
Directed graph
– may (usually does) have loops
•
•
Discrete time steps
Each time step, state advances
– with probabilities labeled on outgoing edges
– self loops also ok
•
“Markov” because no memory
– knows only what state in now
•
Markov models especially important because exists a fast algorithm for
“parsing” their state from observed data
– so-called “Hidden Markov Models” (HMMs)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
3
A (right) stochastic matrix has non-negative entries with rows summing to 1
from i
transpose, because of the way
“from/to” are defined
to j
population vector
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
4
Note the two different ways of drawing the same Markov model:
directed graph with loops
adding the time dimension,
directed graph, no loops (can’t
go backward in time)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
5
Every Markov model has at least one stationary (equilibrium) state
ATs= s
( )
(A T ¡ 1) s = 0
Is there a nullspace? Yes, because columns all sum to zero, hence linearly dependent.
But how do we know that this s has nonnegative components?
define a vector
s+
s+ ´ jsi j
i
with components
X
8j ;
A i j s+ =
X
i
jA i j j jsi j
¯i
¯
¯X
¯
¯
¯
¸ ¯ A i j si ¯ = jsj j = s+
j
¯
¯
X X 1
Xi
(
A1i j )s+ ¸
s+
i
j
i
summing over j,
i
j
j
X and hence also for each term, implying
so the  must be an = in the sum,
A i j s+ = s+
i
j
i
Therefore s+ is the desired (positive) stationary state, qed.
(This is actually a simple special case of the Perron-Frobenius theorem.)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
6
Does every Markov model eventually reach a unique equilibrium from all
starting distributions (ergodicity)?
Not necessarily. Two things can go wrong:
1. More than one equilibrium distribution. (Fails test of “irreducibility”.)
2. Limit cycles. (Fails test of “aperiodicity”.)
The theorem is: Irreducibility and aperiodicity imply ergodicity.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
7
Easy to diagnose a particular Markov model numerically by taking a high
power of its transition matrix (by successive squaring)
Say, take AT to the power 232, which requires 2 x 32 M3 operations.
If the columns of the result are all identical, then it’s ergodic. (Done.)
Otherwise:
1. Zero rows are states that become unpopulated. Ignore their corresponding
columns.
2. See if remaining columns are self-reproducing under AT (eigenvectors of unit
eigenvalue). When yes, they are equilbria.
3. When no, they are part of limit cycles.
So, this example has an unpopulated state
and no equilibria.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
8
In a Hidden Markov Model, we don’t get to observe the states, but instead we
see a “symbol” that each state probabilistically emits when it is entered
sequence of (hidden) states
sequence of (observed) symbols
What can we say about the sequence of states, given a sequence of observations?
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
9
The Forward-Backward algorithm.
Let’s try to estimate the probability of being in a certain state at a certain time
Define
as the probability of state i at time t given (only) the data up
to and including t . “forward estimate”
huge sum over all possible paths!
likelihood (or Bayes probability with uniform
prior) of that exact path and the exact
observed data
As written, this is computationally unfeasible.
But it satisfies an easy recurrence!
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
10
Define
as the probability of state i at time t given (only) the data to
the future of t . “backward estimate”
uniform prior: no data to the future of N-1
Now, there is a backward recurrence!
And the grand estimate using all the data is (“forward-backward
algorithm”)
Likelihood or Bayes probability of the data. Actually, it’s independent of t !
You could use its numerical value to compare different models.
Worried about multiplying the a’s and b’s as independent probabilities?
Markov guarantees that they are conditionally independent given i, and Pt (i)  Pt (data | i)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
11
Let’s work a biologically motivated example
In the galaxy Zyzyx, the Qiqiqi lifeform has a genome
consisting of a linear sequence of amino acids, each
chosen from 26 chemical possibilities, denoted A-Z.
Genes alternate with intergenic regions.
In intergenic regions, all A-Z are equiprobable.
In genes, the vowels AEIOU are more frequent.
Genes always end with Z.
The length distribution of genes and intergenic
regions is known (has been measured).
On Earth, it’s 20 amino acids,
with the additional
complication of a genetic code
mapping three base-4 codons
(ACGT) into one a.a. Our
example thus simplifies by
having no ambiguity on
reading frame, and also no
ambiguity of strand.
genes
intergenes
Can we find the genes?
qnkekxkdlscovjfehvesmdeelnzlzjeknvjgetyuhgvxlvjnvqlmcermojkrtuczg
rbmpwrjtynonxveblrjuqiydehpzujdogaensduoermiadaaustihpialkxicilgk
tottxxwawjvenowzsuacnppiharwpqviuammkpzwwjboofvmrjwrtmzmcxdkclvky
vkizmckmpvwfoorbvvrnvuzfwszqithlkubjruoyyxgwvfgxzlzbkuwmkmzgmnsyb
(pvowell = 0.45)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
12
The model:
The code looks like this (NR3 C++ fragment)
Int i,j,mstat=3;
MatDoub b(mstat,26,0.), a(mstat,mstat,0.);
a[0][0] = 1.-1./250.;
a[0][1] = 1.-a[0][0];
a[1][1] = 1.-1./50.;
a[1][2] = 1.-a[1][1];
a[2][0] = 1.;
for (i=0;i<26;i++) b[0][i] = 1./26.;
for (i=0;i<5;i++) b[1][i] = pvowel/5.;
for (i=5;i<26;i++) b[1][i] = (1.-pvowel)/21.;
b[2][25] = 1.;
HMM hmm(a,b,seq);
hmm.forwardbackward();
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
13
Embed in a mex-file and return hmm.pstate to Matlab.
Matlab has its own HMM functions, but I haven’t mastered them. (Could someone do this and
show this example?)
[pstate pcorrect loglike merit] = hmmmex(0.45,1);
plot(1:260,pstate(1:260,2),'b')
hold on
plot(1:260,pstate(1:260,3),'r')
The forward-backward results on the previous data are:
All the C++ code for this
example is on the course
website as “hmmmex.cpp” –
but beware, it’s not cleaned
up or made pretty!
state G
the right “z”
enough chance excess vowels to make it
not completely sure!
actual start
another “z”
state Z
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
14
Bayesian re-estimation of the transition and output matrices
(Baum-Welch re-estimation)
Given the data, we can re-estimate A as follows
the backward recurrence says
that these are equal
So, estimating as an average over the data,
number of i  j transitions
number of i states
(note that L cancels)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
15
Similarly, re-estimate b
number of i states emitting k
number of i states
Hatted A and b are improved estimates of the hidden parameters.
With them, you can go back and re-estimate a and b. And so forth.
Does this remind you of the EM method?
It should! It is another special case. Can prove by the same kind of
convexity method as previously that Baum-Welch re-estimation always
increases the overall likelihood L, iteratively to an (as usual possibly only
local) maximum.
Notice that re-estimation doesn’t require any additional information, or
any training data. It is pure “unsupervised learning”.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
16
Before (previous result)
After re-estimation (data size N=105)
parsing (forward-backward) can work on even small
fragments, but re-estimation takes a lot of data
how log-likelihood increases with
iteration number
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
17
On many problems, re-estimation can hill-climb to the “right” answer from
amazingly crude initial guesses
a[0][0] = 1.-1./100.;
a[0][1] = 1.-a[0][0];
a[1][1] = 1.-1./100.;
a[1][2] = 1.-a[1][1];
a[2][0] = 1.;
for (i=0;i<26;i++) {
b[0][i] = 1./26.;
b[1][i] = 1./26.;
b[2][i] = 1./26.;
}
a period of stagnation
is not unusual
log-likelihood increases monotonically
genes and intergenes alternate and are each
about 100 long
there is a one-symbol gene end marker
but we don’t know anything about which
symbols are preferred in genes, end-genes, or
intergenes
this 1st step is an artifact: it
is calling all states as 0,
which happens to be true
~90% of the time
accuracy (in this example we know the “right” answers!)
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
18
Final estimates of the transition and symbol probability matrices:
A
b
A
E
I
O
U
B
C
D
F
G
H
J
K
L
M
N
P
Q
R
S
T
V
W
X
Y
Z
state I
state G
state Z
0.99397
0.00603
0.00000
0.00000
0.96991
0.03009
1.00000
0.00000
0.00000
0.03746
0.03935
0.03684
0.03875
0.03740
0.03772
0.03891
0.03945
0.03862
0.03888
0.03884
0.03652
0.03838
0.03836
0.03823
0.03885
0.03888
0.03880
0.04055
0.03933
0.03923
0.03924
0.03862
0.03820
0.03871
0.03586
0.09039
0.09196
0.08903
0.08992
0.09214
0.02590
0.02716
0.02792
0.02515
0.02505
0.02874
0.02926
0.02777
0.02673
0.02822
0.02639
0.02677
0.02743
0.02844
0.02862
0.02393
0.02772
0.02826
0.02945
0.02759
0.00006
0.00245
0.00218
0.00266
0.00109
0.00144
0.00096
0.00686
0.00140
0.00037
0.00057
0.00116
0.00188
0.00069
0.00113
0.00035
0.00005
0.00493
0.00572
0.00115
0.00769
0.00053
0.00057
0.00308
0.00039
0.00045
0.95026
1/0.00603 = 166
1/0.03009 = 33.2
why these values?
Yes, it discovered all the vowels.
It discovered Z, but didn’t quite figure out
that state 3 always emits Z
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
19
An obvious flaw in the model: Self-loops in Markov models must always
give (discrete approximation of) exponentially distributed residence times
exit
event
waiting time to an event in a Poisson
process is exponentially distributed
But Qiqiqi genes and intergenes are roughly gamma-law distributed in length
(In fact, they’re exactly gamma-law,
because that’s how I constructed the
genome – not from a Markov model!)
mug = 50.;
sigg = 10.;
mui = 250.;
ag = SQR(mug/sigg);
bg = mug/SQR(sigg);
Gammadev rani(2.,2./mui,ran.int32());
Gammadev rang(ag,bg,ran.int32());
Can we make the results more accurate
by somehow incorporating length info?
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
20
Generalized Hidden Markov Model (GHMM)
also called Hidden Semi-Markov Model (HSMM)
the idea is to impose (or learn by re-estimation) an arbitrary probability
distribution for the residency time t in each state
can be thought of as an ordinary HMM where every state gets expanded into
a “timer” cluster
output symbol probabilities identical for all states in a timer
(equal those of the state before it was expanded)
arbitrary distribution with t ≤ n
¿ » [p1 ; (1 ¡ p1 )p2 ; (1 ¡ p1 )(1 ¡ p2 )p3 ; : : :]
Gamma-law distribution
¿ » Gamma(®; p)
h¿i = ®
p
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
21
So, our intergene-gene-Z example becomes:
initialize the model like this:
input values for n0 and n1 (we’ll try various choices)
Int i,j,n01=n0+n1,mstat=n01+1,niter=40;
initial guess for lengths (need not be this perfect)
Doub len0=250.,len1=50.;
MatDoub b(mstat,26,0.), a(mstat,mstat,0.);
for (i=0;i<n0;i++) {
a[i][i] = 1.-n0/len0;
a[i][i+1] = 1.-a[i][i];
}
Gamma-law timers
for (i=n0;i<n01;i++) {
a[i][i] = 1.-n1/len1;
a[i][i+1] = 1.-a[i][i];
}
a[n01][0] = 1.;
for (j=0;j<n01;j++) for (i=0;i<26;i++) b[j][i] = 1./26.;
b[n01][25] = 1.;
tell it about Z, but not about vowels
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
22
We can use NR3’s HMM class for GHMMs by the kludge of averaging the
output probabilities after each Baum-Welch re-estimation
HMM hmm(a,b,seq);
hmm.forwardbackward();
for (i=1;i<niter;i++) {
hmm.baumwelch();
collapse_to_ghmm(hmm,n0,n1);
hmm.forwardbackward();
}
with
void collapse_to_ghmm(HMM &hmm, Int n0, Int n1) {
Int i,j,n01=n0+n1;
Doub sum;
for (j=0;j<26;j++) {
for (sum=0.,i=0;i<n0;i++) sum += hmm.b[i][j];
sum /= n0;
for (sum=0.,i=n0;i<n01;i++) sum += hmm.b[i][j];
sum /= n1;
for (i=n0;i<n01;i++) hmm.b[i][j] = sum;
}
}
See hmmmex.cpp. Actually this is not
quite right, because it should be a
weighted average by the number of times
each state is occupied. The right way to
do this would be to overload
hmm.baumwelch with a slightly modified
version that does the average properly.
In this example the effect would be
negligible.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
23
So how well do we do?
Accuracy wrt genes shown as
n0 = n1 = 1 (previous HMM)
TP
FP
accuracy =
0.9629
table =
0.1466
0.0144
sensitivity = TP/(TP+FN)
specificity = TN/(FP+TN)
0.0227
0.8163
For whole genes (length ~50), the
sensitivity and specificity are basically
1.0000, because, with pvowel=0.45,
the gene is highly statistically
significant. What the HMM or GHMM
does well is to call the boundaries as
exactly as possible.
n0 = 2, n1 = 5
accuracy =
0.9690
table =
0.1498
0.0115
FN
TN
0.0195
0.8192
n0 = 3, n1 = 8
accuracy =
0.9726
table =
0.1518
0.0099
typically, for this example,
it’s starting a gene ~5 too late
0.0175
0.8208
Obviously there’s a theoretical bound on the achievable accuracy,
given that the exact sequence of an FP or FN might also occur as a
TP or TN. Can you calculate or estimate the bound?
or ~3 too early
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press
24