Transcript lecture5

Advanced Algorithms
and Models for
Computational Biology
-- a machine learning approach
Computational Genomics II:
HMM variants and Comparative
Gene Finding
Eric Xing
Lecture 5, February 1, 2005
Reading: Chap 3, 5 DEKM book
Chap 9, DTW book
Higher-order HMMs


The Genetic Code

3 nucleotides make 1 amino acid

Statistical dependencies in triplets
Question:

Recognize protein-coding
segments with an HMM
Higher-order HMMs

Every state of the HMM emits 1 nucleotide

Transition probabilities:
Probability of a state at one
position, given those of 3
previous positions (triplets):
P(yi | yi-1, yi-2, yi-3)

Emission probabilities:
y1,...,N = i
i
e
e
y1
y2
y3
y4
...
yN
xA1
xA2
xA3
xA4
...
xAN
T
T
x1,...,N = A
C
P(xi | yi)

Algorithms extend with small modifications
i
G
Inference on Higher-order HMMs

Building 1st-order HMM on "mega" state
y1,...,N =

i
i
e
e
i
y1
y2
y3
y4
...
xA1
xA2
xA3
xA4
... xAN
A
C
Use FB algorithm as usual

P(Q2|R)
 P(Y2, Y3, Y4 |X)
 P(Y3 |X)=SY2,Y4 P(Y2, Y3, Y4 |X)
x1,...,N =
T
T
yN
G
Q1
Q2
Q3
Y1,Y2,Y3
Y2,Y3,Y4
Y3,Y4,Y5
...
X1,X2,X3
X2,X3,X4
X3,X4,X5
...
R1
R2
R3
Modeling the Duration of States

1-p
Length distribution of region X:
E[lX] = 1/(1-p)
p
X
Y
1-q

Geometric distribution, with mean 1/(1-p)

(homework: derive this)

This is a significant disadvantage of HMMs

Several solutions exist for modeling different length distributions
q
Observed Duration Time
Poisson Point Process

A counting process that represents the total number of
occurrences of discrete events during a temporal/spatial
interval

the number of occurrences in any internal of length  is Poisson
distributed with parameter :
p( A(t +  ) - A(n) = n) = e-
( ) n
n!

the number of occurrences in disjoint intervals are independent

the duration of the interval between two consecutive occurrences has
the following distribution:
p( < s) = 1 - e-s
Poisson point process
m= 
Truncation is needed at both ends!
Generalized HMM
Upon entering a state:
1.
2.
3.
Choose duration d, according to probability distribution
Generate d letters according to emission probs
Take a transition to next state according to transition probs
y2
y1
d1
d2
xA
d1
y3
d3
xA
d2
y4
...
dN
d4
xA
d3
yN
xA
d2
...
Disadvantage: Increase in complexity:
Time: O(D2)
Space: O(D)
where D = maximum duration of state
xA
dN
Comparative Genomics
A pairwise comparison between
human and mouse genome
Aligning One Locus
Three Pairwise Alignments
Example: a human/mouse
ortholog
Paired HMM
M
(+1,+1)
Alignments correspond
1-to-1 with sequences
of states M, I, J
I
(+1, 0)
J
(0, +1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC
IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
Let’s score the transitions
s(xi, yj)
Alignments correspond
1-to-1 with sequences
of states M, I, J
M
(+1,+1)
s(xi, yj)
-d
-d
-e
I
(+1, 0)
s(xi, yj)
-e
J
(0, +1)
-e
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC
IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
-e
A Pair HMM for alignments
1 – 2 – 
BEGIN

M
P(xi, yj)
1 – 2 – 


I


P(xi)
1 – 2 – 
1 – 2 – 

J
P(yj)

M


I


END
J

Gene Finding
Generalized HMM Gene finder
Generalized Pair-HMM gene
finder
Hierarchical state transition in
pHMM
Allowing for inserted exons
Acknowledgments

Serafim Batzoglou: for some of the slides adapted or
modified from his lecture slides at Stanford University

Lior Pachter': for some of the slides modified from his
lectures at UC Berkeley