Hidden Markov Models

Download Report

Transcript Hidden Markov Models

Hidden Markov Models
part 2
CBB 231 / COMPSCI 261
Recall: Training Unambiguous Models
transitions
CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC
011111112222222111111222211111112222111110
to state
0
from
state
0
0 (0%)
1
2
1 (100%) 0 (0%)
1
1 (4%)
21 (84%)
3 (12%)
2
0 (0%)
3 (20%)
12 (80%)
ai , j 
Ai , j

|Q|1
h 0
Ai ,h
emissions
symbol
A
C
G
T
in
1
state
6
(24%)
7
(28%)
5
(20%)
7
(28%)
2
3
(20%)
3
(20%)
2
(13%)
7
(47%)
ei,k 
Ei,k

| |1
h0
Ei,h
Training Ambiguous Models
The Problem:
We have training sequences, but not the associated paths (state labels)
Two Solutions:
1. Viterbi Training: Start with random HMM parameters. Use
Viterbi to find the most probable path for each training sequence,
and then label the sequence with that path. Use labeled sequence
training on the resulting set of sequences and paths.
2. Baum-Welch Training: sum over all possible paths (rather
than the single most probable one) to estimate expected counts
Ai,j and Ei,k; then use the same formulas as for labeled sequence
training on these expected counts:
ai , j 
Ai , j

|Q|1
h 0
ei,k 
Ai ,h
Ei,k

| |1
h0
Ei,h
Viterbi Training
training
features
initial
submodel
M
Viterbi
Decoder
Sequence
Labeler
paths
{i}
labeled
features
Labeled
Sequence
Trainer
final
submodel
M*
new
submodel
M
repeat n times
(iterate over the training sequences)
(find most probable path for this sequence)
(label the sequence with that path)
Recall: The Forward Algorithm
1
0

0
F (i,k )  
| Q| 1
 F ( j,k  1)Pt (qi | qj )Pe (xk 1 | qi )
j0


for k  0,i  0
for k  0,i  0
for k  0,i  0
for 1 k  | S |,
1 i | Q |
F(i,k) represents the probability P(x0...xk-1, qi) that the machine

emits the subsequence x0...xk-1 by any path ending in state qi—i.e.,
so that symbol xk-1 is emitted by state qi.
| Q| 1
P(S | M ) 
 F (i,| S |)P (q
t
i0
0
| qi )
The Backward Algorithm
| Q| 1
 Pt (qj | qi )Pe (xk | qj )B( j,k  1)
B(i,k )   j1

Pt (q0 | qi )

if k  L,
if k  L.
B(i,k) = probability that the machine M will emit the
subsequence xk...xL-1 and then terminate, given that M is
currently in state qi (which has already emitted xk-1).
THEREFORE:
B(0,0)  P(S)

Backward Algorithm: Pseudocode
| Q| 1
 Pt (qj | qi )Pe (xk | qj )B( j,k  1)
B(i,k )   j1

Pt (q0 | qi )
if k  L,
if k  L.
Baum-Welch: Summing over All Paths
(i,k-1)
F(i,k)B(i,k) = P(M emits x0...xk-1...xL-1, with xk-1 being
emitted by state qi).
F(i,k)Pt(qj|qi)Pe(xk|qj)B( j,k+1) = P(M emits x0...xk-1xk...xL-1
and transitions from state qi to qj at time k-1→k).
Combining Forward & Backward
F(i,k) = P(x0...xk-1,qi) = P(M emits x0...xk-1 by any path ending in
state qi, with xk-1 emitted by qi).
B(i,k) = P(xk...xL-1|qi) = P(M emits xk...xL-1 and then terminates,
given that M is in state qi, which has
emitted xk-1).
F(i,k)B(i,k) = P(x0...xk-1,qi)P(xk...xL-1|qi)=P(x0...xL-1,qi)*
F(i,k)B(i,k)/P(S) = P(qi, k-1 | S)
expectation(Eq,s )   P(qi , k | S)C(qi , k)  
q i ,k
q i ,k
F(i, k 1)B(i, k 1)
C(qi , k)
P(S)
where C(qi,k)=1 if qi=q and xi=s; otherwise 0.
expectation( Ai, j ) 

q m ,q n ,k
F(m, k)Pt (qn | qm )Pe (xk | qn )B(n, k 1)
C(qm ,qn , k)
P(S)
where C(qm,qn,k)=1 if qm=qi and qn=qj; otherwise 0.
*assuming
P(xk...xL-1) is conditionally independent of P(x0...xk-1), given qi.
Baum-Welch Training
compute Fwd & Bkwd DP matrices
accumulate expected counts for E & A
F (i, k 1) B(i, k 1)
P(S)


F(m, k)Pt (qn | qm )Pe (xk | qn )B(n, k 1)
P(S)
Using Logarithms in Forward & Backward
n1 
 n1 log p log p 
0
log pi  log p0  log1  e i

i0 
 i1

(6.43)
In the log-space version of these algorithms, we can replace the
raw probabilities pi with their logarithmic counterparts, log pi,
and apply the Equation (6.43) whenever the probabilities are to
be summed. Evaluation of the elog p -log p term should generally
not result in numerical underflow in practice, since this term
evaluates to pi/p0, which for probabilities of similar events
should not deviate too far from unity.
i
(due to Kingsbury & Rayner, 1971)
0
Monotonic Convergence Behavior
Posterior Decoding
Forward algorithm
fixed path
Backward algorithm

Posterior Decoding
P(E is an exon| S) 
 P( | S)
all 
containing E
P(E  [i, j ] is an exon | S ) 
 j

 F ( y,i)Pt (qexon | qy ) Pe (xk | qexon )
y qexon ,
 k i

zqexon
 Pt (qexon | qexon ) Pt (z | qexon )Pe (x j1 | z)B(z, j  2)
ji
Summary
•Training of ambiguous HMM’s can be accomplished using
Viterbi training or the Baum-Welch algorithm
•Viterbi training performs labeling using the single most
probable path
•Baum-Welch training instead estimates transition & emission
events by computing expectations via Forward-Backward, by
summing over all paths containing a given event
•Posterior decoding can be used to estimate the probability
that a given symbol or substring was generate by a particular
state