Transcript x 1

Parameter Estimation For HMM
Background Readings: Chapter 3.3 in the book, Biological
Sequence Analysis, Durbin et al., 2001.
.
Reminder: Hidden Markov Model
S1
T
x1
M
S2
T
x2
M
M
SL-1
T
XL-1
M
SL
T
xL
Markov Chain transition probabilities: p(Si+1= t|Si = s) = ast
Emission probabilities: p(Xi = b| Si = s) = es(b)
p( s, x)  p( s1 ,
L
, sL ; x1 ,..., xL )   asi1si esi ( xi )
i 1
2
Reminder: Most Probable state path
S1
T
x1
M
S2
M
T
M
SL-1
M
T
x2
XL-1
SL
T
xL
Given an output sequence x = (x1,…,xL),
A most probable path s*= (s*1,…,s*L) is one which
maximizes p(s|x).
s*  ( s1* ,..., s*L ) 
maxarg p(s1,..., sL | x1,..., xL )
(s1 ,..., sL )
3
Reminder: Viterbi’s algorithm for most
probable path
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
0
We add the special initial state 0.
Initialization: v0(0) = 1 , vk(0) = 0 for k > 0
For i=1 to L do for each state l :
vl(i) = el(xi) MAXk {vk(i-1)akl }
ptri(l)=argmaxk{vk(i-1)akl} [storing previous state for reconstructing
the path]
Termination: the probability of the most probable path
p(s1*,…,sL*;x1,…,xl) = max{vk ( L)}
k
4
Predicting CpG islands via most probable path:
A-
C-
A
C
G
+
G
T-
T+
T
T
Output symbols: A, C, G, T (4 letters).
Markov Chain states: 4 “-” states and 4 “+” states, two for each
letter (8 states).
The transitions probabilities ast and ek(b) will be discussed soon.
The most probable path found by Viterbi’s algorithm predicts CpG
islands.
Experiment (Durbin et al, p. 60-61) shows that the predicted islands are
shorter than the assumed ones. In addition quite a few “false negatives” are
found.
5
Reminder: finding most probable state
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
fl(i) = p(x1,…,xi,si=l ), the probability of a path which emits (x1,..,xi) and in
which state si=l.
bl(i)= p(xi+1,…,xL,si=l), the probability of a path which emits (xi+1,..,xL) and in
which state si=l.
1. The forward algorithm finds {fk(si) = P(x1,…,xi,si=k): k = 1,...m}.
2. The backward algorithm finds {bk(si) = P(xi+1,…,xL|si=k ): k = 1,...m}.
3. Return {p(Si=k|x) = fk(si) bk(si) |k=1,...,m}.
To Compute for every i simply run the forward and backward algorithms once, and
compute {fk(si) bk(si)} for every i, k.
6
Finding the probability that a letter is in a CpG
island via the algorithm for most probable state:
A-
C-
A
C
G
+
G
T-
T+
T
T
The probability that an occurrence of G is in a CpG island (+ state) is:
∑s+ p(Si =s+ |x) = ∑s+ F(Si=s+ )B(Si=s+ )
Where the summation is formally over the 4 “+” states, but actually
only state G+ need to be considered (why?)
7
Parameter Estimation for HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
An HMM model is defined by the parameters: akl and ek(b), for all
states k,l and all symbols b.
Let θ denote the collection of these parameters.
k
akl
l
ek(b)
b
8
Parameter Estimation for HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
To determine the values of (the parameters in) θ, use a
training set = {x1,...,xn}, where each xj is a sequence which
is assumed to fit the model.
Given the parameters θ, each sequence xj has an assigned
probability p(xj|θ) (or p(xj| θ,HMM)).
9
ML Parameter Estimation for HMM
The elements of the training set {x1,...,xn}, are assumed to be
independent,
p(x1,..., xn|θ) = ∏j p (xj|θ).
ML parameter estimation looks for θ which
maximizes the above.
The exact method for finding or approximating
this θ depends on the nature of the training set
used.
10
Data for HMM
Possible properties of (the sequences in) the training set:
1.For each xj, what is our information on the states si (the
symbols x i are usually known).
2.The size (number of sequences) of the training set
S1
T
x1
M
S2
T
x2
M
M
SL-1
T
XL-1
M
SL
T
xL
11
Case 1: Sequences are fully known
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We know the complete structure of each sequence in the
training set {x1,...,xn}. We wish to estimate akl and ek(b) for
all pairs of states k, l and symbols b.
By the ML method, we look for parameters θ* which
maximize the probability of the sample set:
p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
12
Case 1: ML method
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
For each xj we have:
p( x j |  ) 
Lj

as
i 1
j
e
(
x
i )
s si
i 1 i
Let mkl= |{i: si-1=k,si=l}| (in xj).
mk(b)=|{i:si=k,xi=b}|
(in xj).
then: p( x j |  )   aklmkl
( k ,l )
mk (b )
[
e
(
b
)]
 k
( k ,b )
13
Case 1 (cont)
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
By the independence of the xj’s, p(x1,...,xn| θ)=∏ip(xj|θ).
Thus, if Akl = #(transitions from k to l) in the training set,
and Ek(b) = #(emissions of symbol b from state k) in the
training set, we have:
1
n
p( x ,.., x |  ) 

( k ,l )
Akl
akl
 [ek (b)]
Ek (b )
( k ,b )
14
Case 1 (cont)
So we need to find akl’s and ek(b)’s which maximize:

 [ek (b)]
Akl
akl
( k ,l )
( k ,b )
For all states k ,
a
Ek (b )
Subject to:
l
kl
 1 and
 e (b )  1
k
b
[a kl , ek (b)  0 ]
15
Case 1 (cont)
Rewriting, we need to maximize:
F
Akl
akl
( k ,l )
[
k
Akl
akl
Ek (b )
[
e
(
b
)]

 k
( k ,b )
] [[ek (b)]
l
Subject to: for all k ,
k
Ek (b )
]
b
 akl  1, and  ek (b)  1.
l
b
16
Case 1 (cont)
If we maximize for each k :

Akl
akl
b
 akl  1
l
l
and also  [ek (b)]Ek (b ) s.t.
s.t.
 ek (b)  1
b
Then we will maximize also F.
Each of the above is a simpler ML problem,
which is treated next.
17
A simpler case:
ML parameters estimation for a die
Let X be a random variable with 6 values x1,…,x6 denoting the
six outcomes of a die. Here the parameters are
θ ={1,2,3,4,5, 6} , ∑θi=1
Assume that the data is one sequence:
Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)
So we have to maximize
P( Data |  )
2
 1
3
2
2 3
2
4 5 6
Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1
i.e., P( Data |  )
 12
  23  32
[and θi 0 ]


  4  5   1    i 
 i 1 
5
2
18
Side comment: Sufficient Statistics


To compute the probability of data in the die example
we only require to record the number of times Ni
falling on side i (namely,N1, N2,…,N6).
We do not need to recall the entire sequence of
outcomes
P( Data | )  
N1
1


N2
2

N3
3

N4
4

N5
5

 1  i 1 i
5

N6
{Ni | i=1…6} is called sufficient statistics for the
multinomial sampling.
19
Sufficient Statistics
A
sufficient statistics is a function of the data that
summarizes the relevant information for the
likelihood
 Formally, s(Data) is a sufficient statistics if for any
two datasets D and D’

s(Data) = s(Data’ )  P(Data|) = P(Data’|)
Exercise:
Define “sufficient
statistics” for the
HMM model.
Datasets
Statistics
20
Maximum Likelihood Estimate
By the ML approach, we look for parameters that
maximizes the probability of data (i.e., the likelihood
function ).
Usually one maximizes the log-likelihood function
which is easier to do and gives an identical answer:


5
log P( Data |  )  log 1N1   2N2  3N3   4N4  5N5  1   i 1i


 i 1 Ni log  i  N6 log 1  i 1 i
5
A necessary
condition for
maximum is:
5


N6



 log P( Data | θ ) N j
N6


0
5
 j
j 1   i
i 1
21
Finding the Maximum
Nj
Rearranging terms:

j
N6
1

5
i
i 1
Divide the jth equation by the ith equation:


6
Sum from j=1 to 6:
6
1

j
j 1
Hence the MLE is given by:
Ni
i 
N

j 1
Ni
Nj
N6
6
j 
i 
Nj
Ni
i
N
i
Ni
i  1,..,6
22
Generalization for distribution with
any number n of outcomes
Let X be a random variable with n values x1,…,xk denoting the k
outcomes of an iid experiments, with parameters
θ ={1,2,...,k} (θi is the probability of xi).
Again, the data is one sequence of length n:
Data = (xi1,xi2,...,xin)
Then we have to maximize
P( Data |  )
n1
 1
n2
2
nk
 k
, (n1  ...  nk  n)
Subject to: θ1+θ2+ ....+ θk=1
i.e., P( Data |  )
 1n1
  knk 11
k 1


  1   i 
 i 1 
nk
23
Generalization for n outcomes (cont)
By treatment identical to the die case, the maximum is obtained
when for all i:
ni
i

nk
k
Hence the MLE is given by:
ni
i 
i  1,.., k
n
24
Fractional Exponents
Some models allow ni’s which are not integers (eg,
when we are uncertain of a die outcome, and consider it
“6” with 20% confidence and “5” with 80%):
We still can have
P( Data |  )
n1
 1
n2
2
nk
 k
, (n1  ...  nk  n)
And the same analysis yields:
ni
i 
i  1,.., k
n
25
Apply the ML method to HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Let Akl = #(transitions from k to l) in the training set.
Ek(b) = #(emissions of symbol b from state k) in the
training set. We need to:
Maximize

( k ,l )
Akl
akl
Ek ( b )
[
e
(
b
)]
 k
( k ,b )
Subject to: for all states k ,
 akl  1, and  ek (b)  1.
l
b
26
Apply to HMM (cont.)
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We apply the previous technique to get for each k the
parameters {akl|l=1,..,m} and {ek(b)|bΣ}:
Akl
Ek (b)
akl 
, and ek (b) 
 l ' Akl '
 b ' Ek (b ')
Which gives the optimal ML parameters
27
Adding pseudo counts in HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
If the sample set is too small, we may get a biased result.
In this case we modify the actual count by our prior
knowledge/belief:
rkl is our prior belief and transitions from k to l.
rk(b) is our prior belief on emissions of b from state k.
Akl  rkl
then akl 
, and ek (b) 
l ' ( Akl '  rkl ' )
Ek (b)  rk (b)
b ' ( Ek (b ')  rk (b))
28
Summary of Case 1:
Sequences are fully known
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We know the complete structure of each sequence in the
training set {x1,...,xn}. We wish to estimate akl and ek(b) for
all pairs of states k, l and symbols b.
We just showed a method which finds the (unique)
parameters θ* which maximizes
p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
29
Case 2: State paths are unknown:
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
In this case only the values of the xi’s of the input
sequences are known.
This is a ML problem with “missing data”.
We wish to find θ* so that p(x|θ*)=MAXθ{p(x|θ)}.
For each sequence x,
p(x|θ)=∑s p(x,s|θ),
taken over all state paths s.
30
Case 2: State paths are unknown
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
So we need to maximize p(x|θ)=∑s p(x,s|θ), where
the summation is over all the sequences S which
produce the output sequence x.
Finding θ* which maximizes ∑s p(x,s|θ) is hard.
[Unlike finding θ* which maximizes p(x,s|θ) for a
single sequence (x,s).]
31
ML Parameter Estimation for HMM
The general process for finding θ in this case is
1. Start with an initial value of θ.
2. Find θ’ so that p(x1,..., xn|θ’) > p(x1,..., xn|θ)
3. set θ = θ’.
4. Repeat until some convergence criterion is met.
A general algorithm of this type is the Expectation
Maximization algorithm, which we will meet later.
For the specific case of HMM, it is the BaumWelch training.
32
Baum Welch training
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We start with some values of akl and ek(b), which define
prior values of θ.
Baum-Welch training is an iterative algorithm which
attempts to replace θ by a θ* s.t.
p(x|θ*) > p(x|θ)
Each iteration consists of few steps:
34
Baum Welch: step 1
s1
X1
..
Si-1
Si
Xi-1
Xi
sL
..
XL
Count expected number of state transitions: For each
sequence xj, for each i and for each k,l, compute the
posterior state transitions probabilities:
P(si-1=k, si=l | xj,θ)
35
Baum Welch training
s1
X1
..
Si-1
Si
Xi-1
Xi
sL
..
XL
Claim:
f k (i  1)akl el ( xi )bl (i )
p( si 1 =k , si =l | x ,  ) 
p( x j |  )
where f k j and bl j are the forward and backward algorithms
j
for x
j
j
j
j
under  .
37
Step 1: Computing P(si-1=k, si=l | xj,θ)
s1
s2
Si-1
si
sL-1
sL
X1
X2
Xi-1
Xi
XL-1
XL
P(x1,…,xL,si-1=k,si=l) = P(x1,…,xi-1,si-1=k) aklek(xi ) P(xi+1,…,xL |si=l)
xj
= fk(i-1) aklek(xi ) bl(i)
Via the forward
algorithm
p(si-1=k,si=l |
xj)
=
Via the backward
algorithm
fk(i-1) aklel(xi ) bl(i)
K’ l’fk’(i-1) ak’l’ek’(xi ) bl’(i)
38
Step 1 (end)
for each pair (k,l), compute the expected number of state
transitions from k to l:
n
1
Akl  
L
 p(si 1 =k , si =l , x
j
j
p
(
x
) i 1
j 1
n

j 1
L
1
p( x
j

)
i 1
| )
f k j (i  1)akl el ( xi )bl j (i )
39
Baum-Welch: Step 2
for each state k and each symbol b, compute the expected
number of emissions of b from k:
n
1
Ek (b)  
j
j 1 p( x )

i: xij b
j
j
f k (i )bk (i )
40
Baum-Welch: step 3
Use the Akl’s, Ek(b)’s to compute the new values of akl
and ek(b). These values define θ*.
Akl
Ek (b)
akl 
, and ek (b) 
 l ' Akl '
 b ' Ek (b ')
It can be shown that:
p(x1,..., xn|θ*) > p(x1,..., xn|θ)
i.e, θ* increases the probability of the data
This procedure is iterated, until some convergence
criterion is met.
41
Case 2: State paths are unknown:
Viterbi training
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Also start from given values of akl and ek(b), which defines
prior values of θ.
Viterbi training attempts to maximize the probability of a
most probable path; i.e., maximize
p((s(x1),..,s(xn)) |θ, x1,..,xn )
Where s(xj) is the most probable (under θ) path for xj.
42
Case 2: State paths are unknown:
Viterbi training
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Each iteration:
1. Find a set {s(xj)} of most probable paths, which
maximize
p(s(x1),..,s(xn) |θ, x1,..,xn )
2. Find θ*, which maximizes
p(s(x1),..,s(xn) | θ*, x1,..,xn )
Note: In 1. the maximizing arguments are the paths, in 2. it is θ*.
3. Set θ=θ* , and repeat. Stop when paths are not changed.
43
Case 2: State paths are unknown:
Viterbi training
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
p(s(x1),..,s(xn) | θ*, x1,..,xn ) can be expressed in a closed
form (since we are using a single path for each xj), so this
time convergence is achieved when the optimal paths are
not changed any more.
44