HMM II: Parameter Estimation

Download Report

Transcript HMM II: Parameter Estimation

HMM II: Parameter Estimation
Reminder: Hidden Markov Model
S1
T
x1
M
S2
M
M
SL-1
T
M
T
x2
XL-1
SL
T
xL
Markov Chain transition probabilities: p(Si+1= t|Si = s) = ast
Emission probabilities: p(Xi = b| Si = s) = es(b)
For each integer L>0, this defines a probability space in
which the simple events are HMM of length L.
p( s, x)  p( s1 ,
L
, sL ; x1 ,..., xL )   asi1si esi ( xi )
i 1
Hidden Markov Model for CpG Islands
The states:
A-
A
Domain(Si)={+, -}  {A,C,T,G} (8 values)
…
…
G
+
G
…
…
T+
T
In this representation P(xi| si) = 0 or 1 depending on whether xi is
consistent with si . E.g. xi= G is consistent with si=(+,G) and with
si=(-,G) but not with any other state of si.
Most Probable state path
S1
T
x1
M
S2
M
T
M
SL-1
M
T
x2
XL-1
Given an observation sequence x = (x1,…,xL),
A most probable path s*= (s*1,…,s*L) is one which
maximizes p(s|x).
s*  ( s1* ,..., s*L ) 
argmax p( s1,..., sL | x1,..., xL )
(s1 ,..., sL )
SL
T
xL
Predicting CpG islands via most
probable path:
A-
C-
A
C
G
+
G
T-
T+
T
T
Output symbols: A, C, G, T (4 letters).
Markov Chain states: 4 “-” states and 4 “+” states, two for each
letter (8 states total).
The most probable path (found by Viterbi’s algorithm) predicts CpG
islands.
Experiment (Durbin et al, p. 60-61) shows that the predicted islands are
shorter than the assumed ones. In addition, quite a few “false negatives” are
found.
Most probable state
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Given an output sequence x = (x1,…,xL), si is a
most probable state (at location i) if:
si=argmaxk p(Si=k |x).
p( Si  si | x1 ,.., x L )
p( Si  si | x1 ,.., x L ) 
 p( Si  si | x1 ,.., x L )
p( x1 ,.., x L )
Finding the probability that a letter
is in a CpG island via the algorithm
for most probable state:
A-
C-
A
C
G
+
G
T-
T+
T
T
i
The probability that the occurrence of G in the i-th location is in a
CpG island (+ state) is:
∑s+ p(Si =s+ |x) = ∑s+ F(Si=s+ )B(Si=s+ )
Where the summation is formally over the 4 “+” states, but actually
only state G+ need to be considered (why?)
Parameter Estimation for HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
An HMM model is defined by the parameters: akl and ek(b), for all
states k,l and all symbols b.
Let θ denote the collection of these parameters.
k
ek(b)
b
akl
l
Parameter Estimation for HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
To determine the values of (the parameters in) θ, use a
training set = {x1,...,xn}, where each xj is a sequence which
is assumed to fit the model.
Given the parameters θ, each sequence xj has an assigned
probability p(xj|θ).
Maximum Likelihood Parameter
Estimation for HMM
The elements of the training set {x1,...,xn}, are assumed to be
independent,
p(x1,..., xn|θ) = ∏j p (xj|θ).
ML parameter estimation looks for θ which
maximizes the above.
The exact method for finding or approximating
this θ depends on the nature of the training set
used.
Data for HMM
Possible properties of (the sequences in) the training set:
1.For each xj, the information on the states sji
2.The size (number of sequences) of the training set
S1
T
x1
M
S2
T
x2
M
M
SL-1
T
XL-1
M
SL
T
xL
Case 1: ML when state paths are fully
known
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We know the complete structure of each sequence in the
training set {x1,...,xn}. We wish to estimate akl and ek(b) for
all pairs of states k, l and symbols b.
By the ML method, we look for parameters θ* which
maximize the probability of the sample set:
p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
Case 1: State paths are fully known
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
For each xj we have:
L
p( x |  )   asi1si esi ( xi )
j
j
i 1
Let mkl= |{i: si-1=k,si=l}| (in xj).
mk(b)=|{i:si=k,xi=b}|
(in xj).
then: p ( x j |  )   aklmkl
( k ,l )
mk ( b )
[
e
(
b
)]
 k
( k ,b )
Case 1
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).
Thus, if Akl = #(transitions from k to l) in the training set,
and Ek(b) = #(emissions of symbol b from state k) in the
training set, we have:
1
n
p ( x ,.., x |  ) 

( k ,l )
Akl
akl
 [ek (b)]
( k ,b )
Ek ( b )
Case 1 (cont)
So we need to find akl’s and ek(b)’s which maximize:

 [ek (b)]
Akl
akl
( k ,l )
Ek (b)
( k ,b )
Subject to:
For all statesk ,
a
l
[akl , ek (b)  0 ]
kl
 1 and
 e (b )  1
k
b
Case 1
Rewriting, we need to maximize:
F
Akl
a
 kl
Ek (b )
[
e
(
b
)]

 k
( k ,l )
( k ,b )
A
E (b )
[
a
]

[
[
e
(
b
)]
]
  kl   k
kl
k
l
Subject to: for all k ,
k
k
b
 akl  1, and  ek (b)  1.
l
b
Case 1 (cont)
If we maximize for each k :

Akl
akl
Ek (b )
[
e
(
b
)]
s.t.
 k
b
 akl  1
l
l
and also
s.t.
 ek (b)  1
b
Then we will maximize also F.
Each of the above is a simpler ML problem,
which is similar to ML parameters
estimation for a die, treated next.
MLE for n outcomes
The MLE is given by the relative frequencies:
ni
i 
i  1,.., k
n
Apply the ML method to HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Let Akl = #(transitions from k to l) in the training set.
Ek(b) = #(emissions of symbol b from state k) in the
training set. We need to:
Maximize
Akl
a
 kl
Ek (b)
[
e
(
b
)]
 k
( k ,l )
( k ,b)
Subject to: for all states k ,
 akl  1, and  ek (b)  1, akl , ek (b)  0.
l
b
Apply to HMM (cont.)
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We apply the previous technique to get for each k the
parameters {akl|l=1,..,m} and {ek(b)|bΣ}:
akl 
Akl
, and ek (b) 
l ' Akl '
Ek (b)
b ' Ek (b ')
Which gives the optimal ML parameters
Summary of Case 1:
State paths are fully known
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We know the complete structure of each sequence in the
training set {x1,...,xn}. We wish to estimate akl and ek(b) for
all pairs of states k, l and symbols b.
Summary: when everything is known, we can find
the (unique set of) parameters θ* which maximizes
p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
Adding pseudo counts in HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
If the sample set is too small, we may get a biased result.
In this case we modify the actual count by our prior
knowledge/belief:
rkl is our prior belief and transitions from k to l.
rk(b) is our prior belief on emissions of b from state k.
Akl  rkl
then akl 
, and ek (b) 
l ' ( Akl '  rkl ' )
Ek (b)  rk (b)
b ' ( Ek (b ')  rk (b))
Case 2: State paths are unknown
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
For a given θ we have:
p(x1,..., xn|θ)= p(x1| θ)    p (xn|θ)
(since the xj are independent)
For each sequence x,
p(x|θ)=∑s p(x,s|θ),
The sum taken over all state paths s which emit x.
Case 2: State paths are unknown
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Thus, for the n sequences (x1,..., xn) we have:
p(x1,..., xn |θ)= ∑
p(x1,..., xn , s1,..., sn |θ),
(s1,..., sn )
Where the summation is taken over all tuples of n state
paths (s1,..., sn ) which generate (x1,..., xn ) .
For simplicity, we will assume that n=1.
Case 2: State paths are unknown
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
So we need to maximize p(x|θ)=∑s p(x,s|θ), where
the summation is over all the sequences S which
produce the output sequence x.
Finding θ* which maximizes ∑s p(x,s|θ) is hard.
[Unlike finding θ* which maximizes p(x,s|θ) for a
single sequence (x,s).]
ML Parameter Estimation for HMM
The general process for finding θ in this case is
1. Start with an initial value of θ.
2. Find θ’ so that p(x|θ’) > p(x|θ)
3. set θ = θ’.
4. Repeat until some convergence criterion is met.
A general algorithm of this type is the Expectation
Maximization algorithm, which we will meet later.
For the specific case of HMM, it is the BaumWelch training.
Baum Welch training
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We start with some values of akl and ek(b), which define
prior values of θ.
Then we use an iterative algorithm which attempts to
replace θ by a θ* s.t.
p(x|θ*) > p(x|θ)
This is done by “imitating” the algorithm for Case 1,
where all states are known:
…
Baum Welch training
Si-1= k
Si= l
xi-1= b
xi= c
…
In case 1 we computed the optimal values of akl and ek(b),
(for the optimal θ) by simply counting the number Akl of
transitions from state k to state l, and the number Ek(b) of
emissions of symbol b from state k, in the training set.
This was possible since we knew all the states.
…
Baum Welch training
Si-1= ?
xi-1= b
Si= ?
…
xi= c
When the states are unknown, the counting
process is replaced by averaging process:
For each edge si-1 si we compute the average
number of “k to l” transitions, for all possible
pairs (k,l), over this edge. Then, for each k and
l, we take Akl to be the sum over all edges.
Baum Welch training
Si = ?
xi-1= b
Similarly, For each edge si b and each state k,
we compute the average number of times that
si=k, which is the expected number of “k → b”
transmission on this edge. Then we take Ek(b) to
be the sum over all such edges.
These expected values are computed by assuming
the current parameters θ:
Akl and Ek(b) when states are unknown
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Akl and Ek(b) are computed according to the current distribution θ,
that is:
Akl=∑sAsklp(s|x,θ), where Askl is the number of k to l transitions in
the sequence s.
Ek(b)=∑sEsk (b)p(s|x,θ), where Esk (b) is the number of times k
emits b in the sequence s with output x.
Baum Welch: step 1a
Count expected number of state
transitions
s1
X1
..
Si-1
Si
Xi-1
Xi
sL
..
XL
For each i, k,l, compute the state transitions
probabilities by the current θ:
P(si-1=k, si=l | x,θ)
For this, we use the forwards and backwards
algorithms
Baum Welch: Step 1a (cont)
s1
X1
..
Si-1
Si
Xi-1
Xi
sL
..
XL
Claim: By the probability distribution of HMM
Fk (i  1)akl el ( xi )Bl (i )
p( si 1  k , si  l | x, ) 
p( x |  )
(akl and el(xi) are the parameters defined by  , and Fk(i-1),
Bk(i) are the forward and backward algorithms)
Step 1a: Computing P(si-1=k, si=l | x,θ)
s1
s2
Si-1
si
sL-1
sL
X1
X2
Xi-1
Xi
XL-1
XL
P(x1,…,xL,si-1=k,si=l|) = P(x1,…,xi-1,si-1=k|) aklel(xi ) P(xi+1,…,xL |si=l,)
x
= Fk(i-1) aklel(xi ) Bl(i)
Via the backward
Via the forward
algorithm
algorithm
p(si-1=k,si=l | x, ) =
Fk(i-1) aklel(xi ) Bl(i)
p( x |  )
Step 1a
For each pair (k,l), compute the expected number of state
transitions from k to l, as the sum of the expected number
of k to l transitions over all L edges :
1
Akl 
p( x |  )
1

p( x |  )
L
 p( s
i 1
 k , si  l , x |  )
i 1
L
 F (i  1)a
k
i 1
e ( xi ) Bl (i )
kl l
Step 1a for many sequences:
When we have n independent input sequences (x1,..., xn ),
then Akl is given by:
n
Akl  
j 1
n

j 1
j
 p(si 1 =k , si =l , x
j
p ( x ) i 1
L
1
p( x
L
1
j

)
i 1
| )
f kj (i  1)akl el ( xi )bl j (i )
where f k j and bl j are the forward and backward algorithms
for x
j
under  .
Baum-Welch: Step 1b
count expected number of symbols emissions
for state k and each symbol b, for each i where Xi=b,
compute the expected number of times that Xi=b.
s1
s2
si
sL-1
sL
X1
X2
Xi=b
XL-1
XL
p( si  k | x1 ,... xL )
 p( x1 ... xL , si  k ) / p( x1 ,... xL )
 f k ( si )bk ( si ) / p( x1 ,..., xL )
Baum-Welch: Step 1b
For each state k and each symbol b, compute the
expected number of emissions of b from k as the sum
of the expected number of times that si = k, over all i’s
for which xi = b.

1
Ek (b) 
Fk (i ) Bk (i )
p( x |  ) i: xi b
Step 1b for many sequences
Exercise: when we have n sequences (x1,..., xn ), the
expected number of emissions of b from k is given by:
n
Ek (b)  
j 1
1
j
p( x )

i:xij b
j
j
Fk (i ) Bk (i )
Summary of Steps 1a and 1b:
the E part of the Baum Welch
training
These steps compute the expected numbers Akl of k,l
transitions for all pairs of states k and l, and the expected
numbers Ek(b) of transmitions of symbol b from state k,
for all states k and symbols b.
The next step is the M step, which is identical to the
computation of optimal ML parameters when all states are
known.
Baum-Welch: step 2
Use the Akl’s, Ek(b)’s to compute the new values of akl
and ek(b). These values define θ*.
Akl
akl 
, and ek (b) 
l ' Akl '
Ek (b)
b ' Ek (b ')
The correctness of the EM algorithm implies that:
p(x1,..., xn|θ*)  p(x1,..., xn|θ)
i.e, θ* increases the probability of the data
This procedure is iterated, until some convergence
criterion is met.