Hidden Markov Models
Download
Report
Transcript Hidden Markov Models
Hidden Markov Models
Fundamentals and applications to
bioinformatics.
Markov Chains
• Given a finite discrete set S of possible states, a
Markov chain process occupies one of these
states at each unit of time.
• The process either stays in the same state or
moves to some other state in S.
• This occurs in a stochastic way, rather than in a
deterministic one.
• The process is memoryless and time
homogeneous.
Transition Matrix
• Let S={S1, S2, S3}. A Markov Chain is
described by a table of transition
probabilities such as the following:
S1
S2
S3
S1
0
1
0
S2
1/3
2/3
0
S3
1/2
1/3
1/6
A simple example
• Consider a 3-state Markov model of the weather. We
assume that once a day the weather is observed as being
one of the following: rainy or snowy, cloudy, sunny.
• We postulate that on day t, weather is characterized by a
single one of the three states above, and give ourselves a
transition probability matrix A given by:
0.4 0.3 0.3
0.2 0.6 0.2
0.1 0.1 0.8
-2• Given that the weather on day 1 is sunny,
what is the probability that the weather for
the next 7 days will be “sun-sun-rain-rainsun-cloudy-sun”?
PS3 , S3 , S3 , S1, S1, S3 , S2 , S3 | Model
P[ S3 ] P[ S3 | S3 ]2 P[ S1 | S3 ] P[ S1 | S1 ]
P[ S3 | S1 ] P[ S 2 | S3 ] P[ S3 | S 2 ]
1 (0.8) 2 (0.1)(0.4)(0.3)(0.1)(0.2)
1.536104
-3• Given that the model is in a known state, what is
the probability it stays in that state for exactly d
days?
d 1
1 aii
p
d
a
• The answer is i
ii
• Thus the expected number of consecutive days in
the same state is
1
di
1 aii
• So the expected number of consecutive sunny
days, according to the model is 5.
Hidden?
• What if each state does not correspond to
an observable (physical) event? What if
the observation is a probabilistic function
of the state?
• To clarify, let us analyze another simple
example, before formally defining Hidden
Markov Models, or simply HMMs.
• The Urn and Ball Model
Elements of an HMM
An HMM is characterized by the following:
1) N, the number of states in the model.
2) M, the number of distinct observation symbols per
state.
3) the state transition probability distribution A aij
where aij P q j | qi
4) the observation symbol probability distribution in state
qj, B bi k , where bj(k) is the probability that the
k-th observation symbol pops up at time t, given that
the model is in state Ej.
5) the initial state distribution p pi
Three Basic Problems for
HMMs
1) Given the observation sequence O =
O1O2O3…Ot, and a model m = (A, B, p), how
do we efficiently compute P(O | m)?
2) Given the observation sequence O and a model
m, how do we choose a corresponding state
sequence Q = q1q2q3…qt which is optimal in
some meaningful sense?
3) How do we adjust the model parameters to
maximize P(O | m)?
Solution to Problem (1)
• Given an observed output sequence O, we
have that P[O] = P[O | Q] P[Q]
Q
• This calculation involves the sum of NT
multiplications, each being a multiplication
of 2T terms. The total number of
operations is on the order of 2T NT.
• Fortunately, there is a much more efficient
algorithm, called the forward algorithm.
The Forward Algorithm
• It focuses on the calculation of the quantity
t, i P[O1,Ot | qt Si ]
which is the joint probability that the
sequence of observations seen up to and
including time t is O1,…,Ot, and that the
state of the HMM at time t is Ei. Once
these quantities are known,
N
PO T , i
i 1
…continuation
• The calculation of the (t, i)’s is by
induction on t.
1, i pibi O1
From the formula
t 1, i PO1 Ot 1 , qt 1 Si and qt S j
N
we get
j 1
N
t 1, i t , j a ijbi Ot 1
j 1
Backward Algorithm
• Another approach is the backward
algorithm.
• Specifically, we calculate (t, i) by the
formula t, i POt 1 OT | qt Ei
• Again, by induction one can find the
(t,i)’s starting with the value t = T – 1,
then for the value t = T – 2, and so on,
eventually working back to t = 1.
Solution to Problem (2)
• Given an observed sequence O = O1,…,OT of
outputs, we want to compute efficiently a state
sequence Q = q1,…,qT that has the highest
conditional probability given O.
• In other words, we want to find a Q that makes
P[Q | O] maximal.
• There may be many Q’s that make P[Q | O]
maximal. We give an algorithm to find one of
them.
The Viterbi Algorithm
• It is divided in two steps. First it finds
maxQ P[Q | O], and then it backtracks to
find a Q that realizes this maximum.
• First define, for arbitrary t and i, (t,i) to be
the maximum probability of all ways to
end in state Si at time t and have observed
sequence O1O2…Ot.
• Then maxQ P[Q and O] = maxi (T,i)
-2P[Q and O]
max P[Q | O] max
Q
Q
P[O]
• But
• Since the denominator on the RHS does
not depend on Q, we have
arg max P[Q | O] arg max P[Q and O]
Q
Q
• We calculate the (t,i)’s inductively.
(1, i) pibi (O1 )
t , j max t 1, i aijb j Ot
1i N
-3• Finally, we recover the qi’s as follows.
T , i
Define T arg1imax
N
and put qT = S(T). This is the last state in
the state sequence desired. The remaining
qt for t < T are found recursively by
defining t arg max t , i ai t 1
1i N
and putting qt S t
Solution to Problem (3)
• We are given a set of observed data from an
HMM for which the topology is known. We wish
to estimate the parameters in that HMM.
• We briefly describe the intuition behind the
Baum-Welch method of parameter estimation.
• Assume that the alphabet M and the number of
states N is fixed at the outset.
• The data we use to estimate the parameters
constitute a set of observed sequences {O(d)}.
The Baum-Welch Algorithm
•
We start by setting the parameters pi, aij,
bi(k) at some initial values.
• We then calculate, using these initial
parameter values:
1) pi* = the expected proportion of times in
state Si at the first time point, given
{O(d)}.
-22)
a
E N | {O }
EN (a) | {O }
b (k )
*
E N ij | {O ( d ) }
ij
(d )
j
3)
(d )
i
i
E[ Ni | {O( d ) }]
where Nij is the random number of times qt(d) =Si
and qt+1(d) = Sj for some d and t; Ni is the random
number of times qt(d) = Si for some d and t; and
Ni(k) equals the random number of times qt(d) =
Si and it emits symbol k, for some d and t.
Upshot
• It can be shown that if = (pi, ajk, bi(k)) is
substituted by * = (pi*, ajk*, bi*(k)) then
P[{O(d)}| *] P[{O(d)}| ], with equality
holding if and only if * = .
• Thus successive iterations continually
increase the probability of the data, given
the model. Iterations continue until a local
maximum of the probability is reached.
Applications
Some preliminary remarks
• Sequence alignment is useful for discovering
functional, structural, and evolutionary
information in biological research.
• Different metrics (or notions of distance) could
be defined to compare sequences.
• Mathematician Peter Sellers (1974) showed that
if a sequence alignment is formulated in terms of
distances instead of similarity, a biologically
more appealing interpretation of gaps is possible.
• The latter is an evolution-motivated definition,
relying on the concept of ancestry.
Modeling Protein Families
• The states of our HMM will be divided into
match states, insert states and delete states.
• It is useful to include an initial state and a final
one, and we assume that no match or delete state
is visited more than once.
• The alphabet M consists of twenty amino acids
together with one dummy symbol representing
“delete”. Delete states output only.
• Each insert and match state has its own
distribution over the 20 amino acids, and does not
emit the symbol .
-2• If the emission probabilities for the match and
insert states are uniform over the 20 amino acids,
the model will produce random sequences not
having much in common.
• If each state emits one specific amino acid with
probability 1, then the model will always produce
the same sequence.
• Somewhere in between these two extremes, the
parameters can be set so that the model is
interesting.
-3• Each choice of parameters produces a different
family of sequences.
• This family can be rather “tight”, or it can be
rather “loose”.
• It is possible that the tightness occurs locally.
• Allowing gap penalties and substitution
probabilities to vary along the sequences reflects
biological reality better.
-4• Dynamic programming and BLAST are essential
for certain applications, but HMMs are more
efficient for modeling large families of
sequences.
• The HMM model is sufficiently flexible to model
the varying features of a protein along its length.
• The model described has proven in practice to
provide a good compromise between flexibility
and tractability. Such HMMs are called profile
HMMs.
-5• All applications start with training.
• This estimation procedure uses the BaumWelch algorithm.
• The model is chosen to have length equal
to the average length of a sequence in the
training set, and all parameters are
initialized by using uniform distributions.
Multiple Sequence Alignment
• The msa of a set of sequences may be viewed as
an evolutionary history of the sequences.
• HMMs often provide a msa as good as, if not
better than, other methods.
• The approach is well grounded in probability
theory
• No sequence ordering is required.
• Insertion/deletion penalties are not needed.
• Experimentally derived information may be
incorporated.
Description
• In this section we describe how to use the theory
of the previous section to compute msa for a set
of sequences.
• The sequences to be aligned are used as the
training data, to train the parameters of the
model.
• For each sequence, the Viterbi algorithm is then
used to determine a path most likely to have
produced that sequence.
-2• Consider the sequences CAEFDDH and
CDAEFPDDH
• Suppose the model has length 10 and their most
likely paths through the model are
m0m1m2m3m4d5d6m7m8m9m10 and
m0m1i1m2m3m4d5m6m7m8m9m10.
• The alignment induced is found by aligning
positions that were generated by the same match
state. This leads to the alignment C–AEF –DDH
CDAEFPDDH
Pfam
• Pfam is a web-based resource maintained by the
Sanger Center http://www.sanger.ac.uk/Pfam
• Pfam uses the basic theory described above to
determine protein domains in a query sequence.
• Suppose that a new protein is obtained for which
no information is available except the raw
sequence.
• We wish to “annotate” this sequence.
-2• The typical starting point is a BLAST search.
• Pfam returns a family of protein domains, which
enriches the information obtained by a BLAST
search.
• The domains in Pfam are determined based on
expert knowledge, sequence similarity, and other
protein family databases.
• Currently, Pfam contains more than 2000
domains.
-3• For each domain a set of examples of this domain
is selected.
• The sequences representing each domain are put
into an alignment, and the alignments themselves
are used to set the parameters.
• Recall that an alignment implies for each
sequence in the alignment a path through the
HMM, as in the previous sections.
-4• The proportion of times these paths take a given
transition is used to estimate the transition and
the emission probabilities.
• Given the HMMs for all the domains, a query
sequence is then run past each one using a
forward algorithm.
• When a portion of the query sequence has
probability of having been produced by an HMM
of a certain cutoff, the corresponding domain is
reported.
Gene Finding
• Currently, a popular and successful gene finder
for human DNA sequences is GENSCAN (Burge
et al. 1997.)
• It is based on a generalization of HMMs, called
Semihidden Markov Models.
• The algorithms involved in this model are an
order of magnitude more complex than for a
regular HMM.
• The gene-finding application requires a
generalization of the Viterbi algorithm.
-2• Burge (1997) observed that if the lengths
of the long intergenic regions can be taken
as having geometric distributions, and if
these lengths generate sequences in a
relatively iid fashion, then the algorithm
can be adjusted so that practical running
times can be obtained.
Final Remarks
• HMMs have been used to model alignments of
three-dimensional structure in proteins (Stultz et
al. 1993; Hubbard and Park 1995; Di Francesco
et al. 1997, 1999; FORREST Web server at
http://absapha.dcrt.nih.gov:8008/)
• In one example of this approach, the models are
trained on patterns of helices, strands, tight
turns, and loops in specific structural classes,
which then may be used to provide the most
probable structure and structural class of a
protein.
Well… those weren’t the final
remarks
• A version of GeneMark (Borodosky and
McIninch 1993) called GeneMark.HMM uses a
particular type of HMM (called a fifth-order
Markov Model) to search for E. coli genes
(Lukashin and Borodovsky 1998).
• The success of the HMM method depends on
having appropriate initial or prior conditions, i.e.,
a good prior model for the sequences and a
sufficient number of sequences to train the
model.
Finally
• Another consideration in using HMMs is the
number of sequences. If a good prior model is
used, it should be possible to train the HMM with
as few as 20 sequences.
• In general, the smaller the sequence number, the
more important the prior conditions.
• HMMs are more effective if methods to inject
statistical noise into the model are used during
the training procedure.