Transcript bioinfo5a

Hidden Markov Models
Fundamentals and applications to
bioinformatics.
Markov Chains
 Given a finite discrete set S of possible states, a
Markov chain process occupies one of these states
at each unit of time.
 The process either stays in the same state or moves
to some other state in S.
 This occurs in a stochastic way, rather than in a
deterministic one.
 The process is memoryless and time homogeneous.
Transition Matrix
 Let S={S1, S2, S3}. A Markov Chain is
described by a table of transition probabilities
such as the following:
S1
S2
S3
S1
0
1
0
S2
1/3
2/3
0
S3
1/2
1/3
1/6
S1
1
S2
1/3
2/3
1/3
1/2
S3
1/6
A simple example
 Consider a 3-state Markov model of the weather. We assume
that once a day the weather is observed as being one of the
following: rainy or snowy, cloudy, sunny.
 We postulate that on day t, weather is characterized by a single
one of the three states above, and give ourselves a transition
probability matrix A given by:
 0.4 0.3 0.3 


 0.2 0.6 0.2 
 0.1 0.1 0.8 


-2 Given that the weather on day 1 is sunny,
what is the probability that the weather for the
next 7 days will be “sun-sun-rain-rain-suncloudy-sun”?
PS3 , S3 , S3 , S1 , S1 , S3 , S2 , S3 | Model
 P[ S3 ]  P[ S3 | S3 ]P[ S3 | S3 ]  P[ S1 | S3 ]  P[ S1 | S1 ]
 P[ S3 | S1 ]  P[ S 2 | S3 ]  P[ S3 | S 2 ]
 1 (0.8) 2 (0.1)(0.4)(0.3)(0.1)(0.2)
 1.536 10  4
-3 Given that the model is in a known state, what is the
probability it stays in that state for exactly d days?
 The answer is
pi d   aii
d 1
1  aii 
 Thus the expected number of consecutive days in the
same state is
1
di 
1  aii
 So the expected number of consecutive sunny days,
according to the model is 5.
Elements of an HMM

What if each state does not correspond to an observable
(physical) event? What if the observation is a probabilistic
function of the state?
An HMM is characterized by the following:
1)
N, the number of states in the model.
2)
M, the number of distinct observation symbols per state.
3)
the state transition probability distribution A  aij
where

aij  P q j | qi
4)
the observation symbol probability distribution in state qj,
, where bj(k) is the probability that the k-th observation symbol
pops up at time t, given that the model is in state Ej.
B  bi k 
5)

 
the initial state distribution
p   pi 
Three Basic Problems for HMMs
1)
2)
3)
Given the observation sequence O = O1O2O3…Ot,
and a model m = (A, B, p), how do we efficiently
compute P(O | m)?
Given the observation sequence O and a model m,
how do we choose a corresponding state sequence
Q = q1q2q3…qt which is optimal in some meaningful
sense?
How do we adjust the model parameters to
maximize P(O | m)?
Solution to Problem (1)
 Given an observed output sequence O, we
have that
P (O)   P[O | Q]  P[Q]
Q
 This calculation involves the sum of NT
multiplications, each being a multiplication of
2T terms. The total number of operations is
on the order of 2T NT.
 Fortunately, there is a much more efficient
algorithm, called the forward algorithm.
The Forward Algorithm
 It focuses on the calculation of the quantity
 t , i   P[O1 ,Ot | qt  Si ]
which is the joint probability that the sequence
of observations seen up to and including time
t is O1,…,Ot, and that the state of the HMM at
time t is Ei. Once these quantities are known,
N
PO     T , i 
i 1
…continuation
 The calculation of the (t, i)’s is by induction
on t.
 1, i   pi bi O1 
From the formula
N
 t  1, i    PO1 Ot 1 , qt 1  Si and qt  S j 
j 1
we get
N
 t  1, i    t , j a ijbi Ot 1 
j 1
Backward Algorithm
 Another approach is the backward algorithm.
 Specifically, we calculate (t, i) by the formula
 Again, by induction one can find the (t,i)’s
starting with the value t = T – 1, then for the
value t = T – 2, and so on, eventually working
back to t = 1.
 t , i   POt 1 OT | qt  Ei 
Solution to Problem (2)
 Given an observed sequence O = O1,…,OT of
outputs, we want to compute efficiently a state
sequence Q = q1,…,qT that has the highest
conditional probability given O.
 In other words, we want to find a Q that makes P[Q |
O] maximal.
 There may be many Q’s that make P[Q | O] maximal.
We give an algorithm to find one of them.
The Viterbi Algorithm
 It is divided in two steps. First it finds maxQ
P[Q | O], and then it backtracks to find a Q
that realizes this maximum.
 First define, for arbitrary t and i, (t,i) to be the
maximum probability of all ways to end in
state Si at time t and have observed
sequence O1O2…Ot.
 Then maxQ P[Q and O] = maxi (T,i)
-2 But
P[Q and O]
max P[Q | O]  max
Q
Q
P[O]
 Since the denominator on the RHS does not
depend on Q, we have
arg max P[Q | O]  arg max P[Q and O]
Q
Q
 We calculate the (t,i)’s inductively.
 (1, i)  pi bi (O1 )
 t , j   max  t  1, i aijb j Ot 
1i  N
-3 Finally, we recover the qi’s as follows. Define
and put:
q T  S (T) . ; T   arg max  T , i 
1i  N
 This is the last state in the state sequence
desired. The remaining qt for t < T are found
recursively by defining
 t   arg max  t , i ai t 1
and putting
1i  N
qt  S t 
Solution to Problem (3)
 We are given a set of observed data from an HMM
for which the topology is known. We wish to estimate
the parameters in that HMM.
 We briefly describe the intuition behind the BaumWelch method of parameter estimation.
 Assume that the alphabet M and the number of
states N is fixed at the outset.
 The data we use to estimate the parameters
constitute a set of observed sequences {O(d)}.
The Baum-Welch Algorithm

We start by setting the parameters pi, aij,
bi(k) at some initial values.
 We then calculate, using these initial
parameter values:
1) pi* = the expected proportion of times in
state Si at the first time point, given {O(d)}.
-22)


a 
EN | {O }
E N (a) | {O }
b (k ) 
*
E Nij | {O ( d ) }
ij
3)
(d )
j
(d )
i
i
E[ N i | {O ( d ) }]
where Nij is the random number of times qt(d) =Si and
qt+1(d) = Sj for some d and t; Ni is the random number
of times qt(d) = Si for some d and t; and Ni(k) equals
the random number of times qt(d) = Si and it emits
symbol k, for some d and t.
Upshot
 It can be shown that if  = (pi, ajk, bi(k)) is
substituted by * = (pi*, ajk*, bi*(k)) then
P[{O(d)}| *]  P[{O(d)}| ], with equality holding
if and only if * = .
 Thus successive iterations continually
increase the probability of the data, given the
model. Iterations continue until a local
maximum of the probability is reached.