Transcript HMM_4

Sequence Alignment via HMM
Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
.
HMM model structure:
1. Duration Modeling
Markov chains are rather limited in describing sequences of
symbols with non-random structures. For instance, Markov
chain forces the distribution of segments in which some state is
repeated for k times to be (1-p)pk-1, for some p.
Several ways enable modeling of other distributions. One is
assigning k >1 states to represent the same “real” state. This
may guarantee k repetitions with any desired probability.
A1
A2
A3
A4
2
HMM model structure:
2. Silent states
States which do not emit symbols (as we saw in the abo locus).
 Can be used to model duration distributions.
 Also used to allow arbitrary jumps (may be used to model
deletions)
 Need to generalize the Forward and Backward algorithms for
arbitrary acyclic digraphs to count for the silent states (see
next):

Silent states:
Regular states:
3
Eg, the forwards algorithm should look:
For regular states v which emit symbol x :

f l (v )  el ( x )
( u ,v )E

k

f k (u )akl 

and for silent states z : f l ( z ) 
 f
k (u )a kl
( u , z )E k
z
Silent states
Directed cycles of silence (or
other) states complicate
things, and should be
avoided.
Regular states
symbols
v
x
4
HMM model structure:
3. High Order Markov Chains
Markov chains in which the transition probabilities
depends on the last n states:
P(xi|xi-1,...,x1) = P(xi|xi-1,...,xi-n)
Can be represented by a standard Markov chain with
more states:
AA
AB
BA
BB
5
HMM model structure:
4. Inhomogenous Markov Chains
 An
important task in analyzing DNA sequences is
recognizing the genes which code for proteins.
 A triplet of 3 nucleotides – called codon - codes for
amino acids (see next slide).
 It is known that in parts of DNA which code for genes,
the three codons positions has different statistics.
 Thus a Markov chain model for DNA should represent
not only the Nucleotide (A, C, G or T), but also its
position – the same nucleotide in different position will
have different transition probabilities. This seems to
work better in some cases (see eg Durbin et al, p. 76)
6
Genetic Code
There are 20 amino acids from which proteins are build.
7
Sequence Comparison using HMM
HMM for sequence alignment, which incorporates
affine gap scores.
“Hidden” States
 Match
 Insertion in x
 insertion in y.
Symbols emitted
 Match: {(a,b)| a,b in ∑ }
 Insertion in x: {(a,-)| a in ∑ }.
 Insertion in y: {(-,a)| a in ∑ }.
8
The Transition Probabilities
M
Transitions probabilities
(note the forbidden ones).
δ = probability for 1st gap
ε = probability for tailing
gap.
M
X
Y
X
Y
1-2δ
δ
δ
1- ε
ε
1- ε
ε
 Emission
Probabilities
 Match: (a,b) with pab – only from M states
 Insertion in x: (a,-) with qa – only from X state
 Insertion in y: (-,a).with qa - only from Y state.
9
The Transition Probabilities
Each aligned pair is generated by the above HMM with
certain probability.
Note that the hidden states can be reconstructed from
the alignment. However, for each pair of sequences x
(of length m) and y (of length n), there are many
alignments of x and y, each corresponds to a different
state-path (the length of the paths are between
max{m,n} and m+n).
Our task is to score alignments by using this model. The
score should reflect the probability of the alignment.
10
Most probable alignment
Let vM(i,j) be the probability of the most probable
alignment of x(1..i) and y(1..j), which ends with a
match. Then using a recursive argument, we get:
v M [i, j ]  p xi y j
 (1  2 )v M (i  1, j  1) 


X
max  (1   )v (i  1, j  1) 


Y
 (1   )v (i  1, j  1) 


11
Most probable alignment
Similarly, vX(i,j) and vY(i,j), the probabilities of the most
probable alignment of x(1..i) and y(1..j), which ends
with an insertion to x or y:
M


v
(i  1, j ) 
X
v [i, j ]  qxi max 

  v X (i  1, j ) 


M


v
(i, j  1) 
Y
v [i, j ]  q y j max 

  vY (i, j  1) 


12
Adding termination probabilities
We may want a model which defines a
probability distribution over all possible
sequences.
M
For this, an END state is
added, with transition
probability τ from any
other state to END. This
assume expected sequence
length of 1/ τ.
X
Y
END
M 1-2δ -τ
δ
δ
τ
X 1-ε -τ
ε
Y
END
1-ε -τ
τ
ε
τ
1
13
Full probability of x and y
Rather then computing the probability of the most probable
alignment, we look for the total probability that x and y are
related by our model.
Let fM(i,j) be the sum of the probabilities of all alignments of
x(1..i) and y(1..j), which end with a match. Similarly, fX(i,j)
and fY(i,j) are the sum of these alignments which end with
insertion to x (y resp.). A “forward” type algorithm for
computing these functions.
Initialization: fM(0,0)=1, fX(0,0)= fY(0,0)=0 (we start from M, but
we could select other initial state).
14
Full probability of x and y (cont.)
f M [i, j ]  p xi y j
 (1  2   ) f M (i  1, j  1) 


X
 (1     ) f (i  1, j  1) 


Y
 (1     ) f (i  1, j  1) 


M


f
(i  1, j ) 
X
f [i, j ]  qxi 

  f X (i  1, j ) 


M


f
(i, j  1) 
Y
f [i, j ]  q y j 

  f X (i, j  1) 


The total probability of all alignments is:
P(x,y|model)= fM[m,n] + fX[m,n] + fY[m,n]
15
The log-odds scoring function

We wish to know if the alignment score is above or
below the score of random alignment.
Recall: The log-odds ratio s(a,b) = log (pab / qaqb).
 log (pab/qaqb)>0 iff the probability that a and b are
related by our model is larger than the probability that
they are picked at random.
 To adapt this for the HMM model, we need to model
random sequence by HMM, with end state.

16
The log-odds scoring function
The transition probabilities
for the random model:
(x is the start state)
X
Y
X 1- η
η
1- η
Y
END
η
1
END
The emission probability for a is qa. Thus the probability of
of x (of length n) and y (of length m) being random is:
n
m
i 1
j 1
p( x, y | Random)   2 (1   ) n  m  qxi  q y j
17
Markov Chains for “Random” and “Model”
M
X
Y
END
M 1-2δ -τ
δ
δ
τ
X 1-ε -τ
ε
Y
1-ε -τ
τ
ε
τ
1
END
“Model”
“Random”
X
Y
X 1- η
η
Y
END
1- η
END
η
1
18
The log-odds scoring function
The resulted scoring function s in the HMM model:
d and e are the penalties for a first gap and a tailing gap, resp, which
are subtracted from the score.
pab
(1  2   )
s( a, b)  log
 log
q a qb
(1   ) 2
(assume move from M state)
 (1     )
(" prepayment" when moving to X or Y states)
(1   )(1  2   )

-e  log
1 -
-d  log
The algorithm corrects for extra prepayment, as follows:
19
Log-odds alignment algorithm
Initialization: VM(0,0)=logτ - 2logη.
 V M (i  1, j  1) 


M
X
V [i, j ]  s ( xi , y j )  max  V (i  1, j  1) 
 Y

 V (i  1, j  1) 


M

V
(i  1, j )  d  Y
 V M (i, j  1)  d 
X
V [i, j ]  max 


 V X (i  1, j )  e  V [i, j ]  max  Y

V
(
i
,
j

1)

e




Termination:
V = max{ VM(m,n), VX(m,n)+c, VY(m,n)+c}
Where c= log (1-2δ-τ) – log(1-ε-τ)
20