Transcript bioinfo5a
Hidden Markov Models
Fundamentals and applications to
bioinformatics.
Markov Chains
Given a finite discrete set S of possible states, a
Markov chain process occupies one of these states
at each unit of time.
The process either stays in the same state or moves
to some other state in S.
This occurs in a stochastic way, rather than in a
deterministic one.
The process is memoryless and time homogeneous.
Transition Matrix
Let S={S1, S2, S3}. A Markov Chain is
described by a table of transition probabilities
such as the following:
S1
S2
S3
S1
0
1
0
S2
1/3
2/3
0
S3
1/2
1/3
1/6
S1
1
S2
1/3
2/3
1/3
1/2
S3
1/6
A simple example
Consider a 3-state Markov model of the weather. We assume
that once a day the weather is observed as being one of the
following: rainy or snowy, cloudy, sunny.
We postulate that on day t, weather is characterized by a single
one of the three states above, and give ourselves a transition
probability matrix A given by:
0.4 0.3 0.3
0.2 0.6 0.2
0.1 0.1 0.8
-2 Given that the weather on day 1 is sunny,
what is the probability that the weather for the
next 7 days will be “sun-sun-rain-rain-suncloudy-sun”?
PS3 , S3 , S3 , S1 , S1 , S3 , S2 , S3 | Model
P[ S3 ] P[ S3 | S3 ]P[ S3 | S3 ] P[ S1 | S3 ] P[ S1 | S1 ]
P[ S3 | S1 ] P[ S 2 | S3 ] P[ S3 | S 2 ]
1 (0.8) 2 (0.1)(0.4)(0.3)(0.1)(0.2)
1.536 10 4
-3 Given that the model is in a known state, what is the
probability it stays in that state for exactly d days?
The answer is
pi d aii
d 1
1 aii
Thus the expected number of consecutive days in the
same state is
1
di
1 aii
So the expected number of consecutive sunny days,
according to the model is 5.
Elements of an HMM
What if each state does not correspond to an observable
(physical) event? What if the observation is a probabilistic
function of the state?
An HMM is characterized by the following:
1)
N, the number of states in the model.
2)
M, the number of distinct observation symbols per state.
3)
the state transition probability distribution A aij
where
aij P q j | qi
4)
the observation symbol probability distribution in state qj,
, where bj(k) is the probability that the k-th observation symbol
pops up at time t, given that the model is in state Ej.
B bi k
5)
the initial state distribution
p pi
Three Basic Problems for HMMs
1)
2)
3)
Given the observation sequence O = O1O2O3…Ot,
and a model m = (A, B, p), how do we efficiently
compute P(O | m)?
Given the observation sequence O and a model m,
how do we choose a corresponding state sequence
Q = q1q2q3…qt which is optimal in some meaningful
sense?
How do we adjust the model parameters to
maximize P(O | m)?
Solution to Problem (1)
Given an observed output sequence O, we
have that
P (O) P[O | Q] P[Q]
Q
This calculation involves the sum of NT
multiplications, each being a multiplication of
2T terms. The total number of operations is
on the order of 2T NT.
Fortunately, there is a much more efficient
algorithm, called the forward algorithm.
The Forward Algorithm
It focuses on the calculation of the quantity
t , i P[O1 ,Ot | qt Si ]
which is the joint probability that the sequence
of observations seen up to and including time
t is O1,…,Ot, and that the state of the HMM at
time t is Ei. Once these quantities are known,
N
PO T , i
i 1
…continuation
The calculation of the (t, i)’s is by induction
on t.
1, i pi bi O1
From the formula
N
t 1, i PO1 Ot 1 , qt 1 Si and qt S j
j 1
we get
N
t 1, i t , j a ijbi Ot 1
j 1
Backward Algorithm
Another approach is the backward algorithm.
Specifically, we calculate (t, i) by the formula
Again, by induction one can find the (t,i)’s
starting with the value t = T – 1, then for the
value t = T – 2, and so on, eventually working
back to t = 1.
t , i POt 1 OT | qt Ei
Solution to Problem (2)
Given an observed sequence O = O1,…,OT of
outputs, we want to compute efficiently a state
sequence Q = q1,…,qT that has the highest
conditional probability given O.
In other words, we want to find a Q that makes P[Q |
O] maximal.
There may be many Q’s that make P[Q | O] maximal.
We give an algorithm to find one of them.
The Viterbi Algorithm
It is divided in two steps. First it finds maxQ
P[Q | O], and then it backtracks to find a Q
that realizes this maximum.
First define, for arbitrary t and i, (t,i) to be the
maximum probability of all ways to end in
state Si at time t and have observed
sequence O1O2…Ot.
Then maxQ P[Q and O] = maxi (T,i)
-2 But
P[Q and O]
max P[Q | O] max
Q
Q
P[O]
Since the denominator on the RHS does not
depend on Q, we have
arg max P[Q | O] arg max P[Q and O]
Q
Q
We calculate the (t,i)’s inductively.
(1, i) pi bi (O1 )
t , j max t 1, i aijb j Ot
1i N
-3 Finally, we recover the qi’s as follows. Define
and put:
q T S (T) . ; T arg max T , i
1i N
This is the last state in the state sequence
desired. The remaining qt for t < T are found
recursively by defining
t arg max t , i ai t 1
and putting
1i N
qt S t
Solution to Problem (3)
We are given a set of observed data from an HMM
for which the topology is known. We wish to estimate
the parameters in that HMM.
We briefly describe the intuition behind the BaumWelch method of parameter estimation.
Assume that the alphabet M and the number of
states N is fixed at the outset.
The data we use to estimate the parameters
constitute a set of observed sequences {O(d)}.
The Baum-Welch Algorithm
We start by setting the parameters pi, aij,
bi(k) at some initial values.
We then calculate, using these initial
parameter values:
1) pi* = the expected proportion of times in
state Si at the first time point, given {O(d)}.
-22)
a
EN | {O }
E N (a) | {O }
b (k )
*
E Nij | {O ( d ) }
ij
3)
(d )
j
(d )
i
i
E[ N i | {O ( d ) }]
where Nij is the random number of times qt(d) =Si and
qt+1(d) = Sj for some d and t; Ni is the random number
of times qt(d) = Si for some d and t; and Ni(k) equals
the random number of times qt(d) = Si and it emits
symbol k, for some d and t.
Upshot
It can be shown that if = (pi, ajk, bi(k)) is
substituted by * = (pi*, ajk*, bi*(k)) then
P[{O(d)}| *] P[{O(d)}| ], with equality holding
if and only if * = .
Thus successive iterations continually
increase the probability of the data, given the
model. Iterations continue until a local
maximum of the probability is reached.