Hidden Markov Models

Download Report

Transcript Hidden Markov Models

Hidden Markov Models
Fundamentals and applications to
bioinformatics.
Markov Chains
• Given a finite discrete set S of possible states, a
Markov chain process occupies one of these
states at each unit of time.
• The process either stays in the same state or
moves to some other state in S.
• This occurs in a stochastic way, rather than in a
deterministic one.
• The process is memoryless and time
homogeneous.
Transition Matrix
• Let S={S1, S2, S3}. A Markov Chain is
described by a table of transition
probabilities such as the following:
S1
S2
S3
S1
0
1
0
S2
1/3
2/3
0
S3
1/2
1/3
1/6
A simple example
• Consider a 3-state Markov model of the weather. We
assume that once a day the weather is observed as being
one of the following: rainy or snowy, cloudy, sunny.
• We postulate that on day t, weather is characterized by a
single one of the three states above, and give ourselves a
transition probability matrix A given by:
 0.4 0.3 0.3 


 0.2 0.6 0.2 
 0.1 0.1 0.8 


-2• Given that the weather on day 1 is sunny,
what is the probability that the weather for
the next 7 days will be “sun-sun-rain-rainsun-cloudy-sun”?
PS3 , S3 , S3 , S1, S1, S3 , S2 , S3 | Model
 P[ S3 ]  P[ S3 | S3 ]2  P[ S1 | S3 ]  P[ S1 | S1 ]
 P[ S3 | S1 ]  P[ S 2 | S3 ]  P[ S3 | S 2 ]
 1 (0.8) 2 (0.1)(0.4)(0.3)(0.1)(0.2)
 1.536104
-3• Given that the model is in a known state, what is
the probability it stays in that state for exactly d
days?
d 1


1 aii 
p
d

a
• The answer is i
ii
• Thus the expected number of consecutive days in
the same state is
1
di 
1  aii
• So the expected number of consecutive sunny
days, according to the model is 5.
Hidden?
• What if each state does not correspond to
an observable (physical) event? What if
the observation is a probabilistic function
of the state?
• To clarify, let us analyze another simple
example, before formally defining Hidden
Markov Models, or simply HMMs.
• The Urn and Ball Model
Elements of an HMM
An HMM is characterized by the following:
1) N, the number of states in the model.
2) M, the number of distinct observation symbols per
state.
3) the state transition probability distribution A  aij
where aij  P q j | qi
4) the observation symbol probability distribution in state
qj, B  bi k  , where bj(k) is the probability that the
k-th observation symbol pops up at time t, given that
the model is in state Ej.
5) the initial state distribution p   pi 


 
Three Basic Problems for
HMMs
1) Given the observation sequence O =
O1O2O3…Ot, and a model m = (A, B, p), how
do we efficiently compute P(O | m)?
2) Given the observation sequence O and a model
m, how do we choose a corresponding state
sequence Q = q1q2q3…qt which is optimal in
some meaningful sense?
3) How do we adjust the model parameters to
maximize P(O | m)?
Solution to Problem (1)
• Given an observed output sequence O, we
have that P[O] =  P[O | Q]  P[Q]
Q
• This calculation involves the sum of NT
multiplications, each being a multiplication
of 2T terms. The total number of
operations is on the order of 2T NT.
• Fortunately, there is a much more efficient
algorithm, called the forward algorithm.
The Forward Algorithm
• It focuses on the calculation of the quantity
 t, i   P[O1,Ot | qt  Si ]
which is the joint probability that the
sequence of observations seen up to and
including time t is O1,…,Ot, and that the
state of the HMM at time t is Ei. Once
these quantities are known,
N
PO   T , i 
i 1
…continuation
• The calculation of the (t, i)’s is by
induction on t.
 1, i   pibi O1 
From the formula
 t  1, i    PO1 Ot 1 , qt 1  Si and qt  S j 
N
we get
j 1
N
 t  1, i    t , j a ijbi Ot 1 
j 1
Backward Algorithm
• Another approach is the backward
algorithm.
• Specifically, we calculate (t, i) by the
formula  t, i   POt 1 OT | qt  Ei 
• Again, by induction one can find the
(t,i)’s starting with the value t = T – 1,
then for the value t = T – 2, and so on,
eventually working back to t = 1.
Solution to Problem (2)
• Given an observed sequence O = O1,…,OT of
outputs, we want to compute efficiently a state
sequence Q = q1,…,qT that has the highest
conditional probability given O.
• In other words, we want to find a Q that makes
P[Q | O] maximal.
• There may be many Q’s that make P[Q | O]
maximal. We give an algorithm to find one of
them.
The Viterbi Algorithm
• It is divided in two steps. First it finds
maxQ P[Q | O], and then it backtracks to
find a Q that realizes this maximum.
• First define, for arbitrary t and i, (t,i) to be
the maximum probability of all ways to
end in state Si at time t and have observed
sequence O1O2…Ot.
• Then maxQ P[Q and O] = maxi (T,i)
-2P[Q and O]
max P[Q | O]  max
Q
Q
P[O]
• But
• Since the denominator on the RHS does
not depend on Q, we have
arg max P[Q | O]  arg max P[Q and O]
Q
Q
• We calculate the (t,i)’s inductively.
 (1, i)  pibi (O1 )
 t , j   max  t  1, i aijb j Ot 
1i  N
-3• Finally, we recover the qi’s as follows.
 T , i 
Define  T   arg1imax
N
and put qT = S(T). This is the last state in
the state sequence desired. The remaining
qt for t < T are found recursively by
defining  t   arg max t , i ai t 1
1i  N
and putting qt  S t 
Solution to Problem (3)
• We are given a set of observed data from an
HMM for which the topology is known. We wish
to estimate the parameters in that HMM.
• We briefly describe the intuition behind the
Baum-Welch method of parameter estimation.
• Assume that the alphabet M and the number of
states N is fixed at the outset.
• The data we use to estimate the parameters
constitute a set of observed sequences {O(d)}.
The Baum-Welch Algorithm
•
We start by setting the parameters pi, aij,
bi(k) at some initial values.
• We then calculate, using these initial
parameter values:
1) pi* = the expected proportion of times in
state Si at the first time point, given
{O(d)}.
-22)


a 
E N | {O }
EN (a) | {O }
b (k ) 
*
E N ij | {O ( d ) }
ij
(d )
j
3)
(d )
i
i
E[ Ni | {O( d ) }]
where Nij is the random number of times qt(d) =Si
and qt+1(d) = Sj for some d and t; Ni is the random
number of times qt(d) = Si for some d and t; and
Ni(k) equals the random number of times qt(d) =
Si and it emits symbol k, for some d and t.
Upshot
• It can be shown that if  = (pi, ajk, bi(k)) is
substituted by * = (pi*, ajk*, bi*(k)) then
P[{O(d)}| *]  P[{O(d)}| ], with equality
holding if and only if * = .
• Thus successive iterations continually
increase the probability of the data, given
the model. Iterations continue until a local
maximum of the probability is reached.
Applications
Some preliminary remarks
• Sequence alignment is useful for discovering
functional, structural, and evolutionary
information in biological research.
• Different metrics (or notions of distance) could
be defined to compare sequences.
• Mathematician Peter Sellers (1974) showed that
if a sequence alignment is formulated in terms of
distances instead of similarity, a biologically
more appealing interpretation of gaps is possible.
• The latter is an evolution-motivated definition,
relying on the concept of ancestry.
Modeling Protein Families
• The states of our HMM will be divided into
match states, insert states and delete states.
• It is useful to include an initial state and a final
one, and we assume that no match or delete state
is visited more than once.
• The alphabet M consists of twenty amino acids
together with one dummy symbol  representing
“delete”. Delete states output  only.
• Each insert and match state has its own
distribution over the 20 amino acids, and does not
emit the symbol .
-2• If the emission probabilities for the match and
insert states are uniform over the 20 amino acids,
the model will produce random sequences not
having much in common.
• If each state emits one specific amino acid with
probability 1, then the model will always produce
the same sequence.
• Somewhere in between these two extremes, the
parameters can be set so that the model is
interesting.
-3• Each choice of parameters produces a different
family of sequences.
• This family can be rather “tight”, or it can be
rather “loose”.
• It is possible that the tightness occurs locally.
• Allowing gap penalties and substitution
probabilities to vary along the sequences reflects
biological reality better.
-4• Dynamic programming and BLAST are essential
for certain applications, but HMMs are more
efficient for modeling large families of
sequences.
• The HMM model is sufficiently flexible to model
the varying features of a protein along its length.
• The model described has proven in practice to
provide a good compromise between flexibility
and tractability. Such HMMs are called profile
HMMs.
-5• All applications start with training.
• This estimation procedure uses the BaumWelch algorithm.
• The model is chosen to have length equal
to the average length of a sequence in the
training set, and all parameters are
initialized by using uniform distributions.
Multiple Sequence Alignment
• The msa of a set of sequences may be viewed as
an evolutionary history of the sequences.
• HMMs often provide a msa as good as, if not
better than, other methods.
• The approach is well grounded in probability
theory
• No sequence ordering is required.
• Insertion/deletion penalties are not needed.
• Experimentally derived information may be
incorporated.
Description
• In this section we describe how to use the theory
of the previous section to compute msa for a set
of sequences.
• The sequences to be aligned are used as the
training data, to train the parameters of the
model.
• For each sequence, the Viterbi algorithm is then
used to determine a path most likely to have
produced that sequence.
-2• Consider the sequences CAEFDDH and
CDAEFPDDH
• Suppose the model has length 10 and their most
likely paths through the model are
m0m1m2m3m4d5d6m7m8m9m10 and
m0m1i1m2m3m4d5m6m7m8m9m10.
• The alignment induced is found by aligning
positions that were generated by the same match
state. This leads to the alignment C–AEF –DDH
CDAEFPDDH
Pfam
• Pfam is a web-based resource maintained by the
Sanger Center http://www.sanger.ac.uk/Pfam
• Pfam uses the basic theory described above to
determine protein domains in a query sequence.
• Suppose that a new protein is obtained for which
no information is available except the raw
sequence.
• We wish to “annotate” this sequence.
-2• The typical starting point is a BLAST search.
• Pfam returns a family of protein domains, which
enriches the information obtained by a BLAST
search.
• The domains in Pfam are determined based on
expert knowledge, sequence similarity, and other
protein family databases.
• Currently, Pfam contains more than 2000
domains.
-3• For each domain a set of examples of this domain
is selected.
• The sequences representing each domain are put
into an alignment, and the alignments themselves
are used to set the parameters.
• Recall that an alignment implies for each
sequence in the alignment a path through the
HMM, as in the previous sections.
-4• The proportion of times these paths take a given
transition is used to estimate the transition and
the emission probabilities.
• Given the HMMs for all the domains, a query
sequence is then run past each one using a
forward algorithm.
• When a portion of the query sequence has
probability of having been produced by an HMM
of a certain cutoff, the corresponding domain is
reported.
Gene Finding
• Currently, a popular and successful gene finder
for human DNA sequences is GENSCAN (Burge
et al. 1997.)
• It is based on a generalization of HMMs, called
Semihidden Markov Models.
• The algorithms involved in this model are an
order of magnitude more complex than for a
regular HMM.
• The gene-finding application requires a
generalization of the Viterbi algorithm.
-2• Burge (1997) observed that if the lengths
of the long intergenic regions can be taken
as having geometric distributions, and if
these lengths generate sequences in a
relatively iid fashion, then the algorithm
can be adjusted so that practical running
times can be obtained.
Final Remarks
• HMMs have been used to model alignments of
three-dimensional structure in proteins (Stultz et
al. 1993; Hubbard and Park 1995; Di Francesco
et al. 1997, 1999; FORREST Web server at
http://absapha.dcrt.nih.gov:8008/)
• In one example of this approach, the models are
trained on patterns of  helices,  strands, tight
turns, and loops in specific structural classes,
which then may be used to provide the most
probable structure and structural class of a
protein.
Well… those weren’t the final
remarks
• A version of GeneMark (Borodosky and
McIninch 1993) called GeneMark.HMM uses a
particular type of HMM (called a fifth-order
Markov Model) to search for E. coli genes
(Lukashin and Borodovsky 1998).
• The success of the HMM method depends on
having appropriate initial or prior conditions, i.e.,
a good prior model for the sequences and a
sufficient number of sequences to train the
model.
Finally
• Another consideration in using HMMs is the
number of sequences. If a good prior model is
used, it should be possible to train the HMM with
as few as 20 sequences.
• In general, the smaller the sequence number, the
more important the prior conditions.
• HMMs are more effective if methods to inject
statistical noise into the model are used during
the training procedure.