Transcript file3
Statistical Alignment:
Computational Properties,
Homology Testing and
Goodness-of-Fit
J. Hein, C. Wiuf, B. Knudsen,
M.B. Moller and G. Wibling
Main Objective of the paper
To show how to accelerate the statistical
alignment algorithms several orders of magnitude
using the model of insertion and deletions by
Thorne, Kishino, and Felsenstein in 1991 (TKF91
model).
To propose a new homology test based on the
model.
To describe a goodness-of-fit test that allows
testing the proposed insertion-deletion process
inherent to the model.
Why isn’t statistical alignment popular?
Computationally VERY SLOW
– Authors of the paper accelerated the statistical
alignment algorithms several orders of magnitude
compared with the TKF91 algorithm.
Lack of user-friendly software?
– Usually written in Fortran or C, or the compiled program
only works in UNIX environment, but most biologists
don’t know much about it.
– Authors of the paper have provided a web interface to
the program
parsimony and similarity alignments
• Parsimony strategy: minimizing the distance
For example:
• Similarity strategy: maximizing the similarity score
For example: BLAST
TKF91 model of substitutions
continuous time Markov model on the
state space of nucleotides or amino acids
Rate matrix Q is specified
– Describes the intensity of different substitution
events over an infinitesimal time period.
– Probability that i has changed to j after time t is
P (etQ)
ij
ij
The process is assumed to be time reversible:
P (t) P (t)
i ij
j ji
TKF91 model of the indel process
Can be view as a Markov model with all
sequences as possible states
indel part of the model
–
–
–
–
links connecting the letters of the sequences
each has a mortal link on the right
left end has an immortal link
For example: A G G
• If the type of the nucleotide is ignored, can be
represented as
TKF91 model
– mortal link can give birth to a new mortal link
or die out
– immortal link can also give birth but would not
die
– Therefore, the rates can be written as:
A G G
I0 S1 I1/D1 S2 I2/D2 S3 I3/D3
where I is the birth rate
D is the death rate, D>I
S is the substitution rate
TKF91 model
To calculate the probability of a particular alignment:
s(1): A T s(2): C T G
P(s(1), s(2), alignment) = (p1’’)(AP1 PAC)(T P2 PTT G)
Calculating the probability of two
sequences
Without conditioning on the alignment, it is
necessary to sum over all alignments weighted
with their probabilities according to the TKF91
process.
Confine likelihood calculations to a band close to
the similarity based alignment allows an efficient
numerical optimization algorithm for finding the
maximum likelihood estimate
The recursions originally presented by Thorne,
Kishino and Felsenstein can be simplified.