Multiple String Comparison – The Holy Grail

Download Report

Transcript Multiple String Comparison – The Holy Grail

Multiple Alignment –
Υλικό βασισμένο στο κεφάλαιο 14
του βιβλίου: Dan Gusfield,
Algorithms on Strings, Trees and
Sequences, Cambridge University
Press
Three cοmmοn representations
• There are three common kinds of family
representations that come from multiple string
comparison:
▫ Profile representations
▫ Consensus sequence representations
▫ Signature representations.
Family representations and alignments
with profiles
• Definition: Given a multiple alignment of a set of
strings, a profile for that multiple alignment
specifies for each column the frequency that
each character appears in the column. A profile
is sometimes also called a weight matrix in the
biological literature.
How to optimally align a string to a
profile
• Definition: For a character y and column j, let
p(y,j) be the frequency that character y appears
in column j of the profile, and let S(x,j) denote
 s  x, y  xp  y, j  the score for aligning x with
column j.
• Let V(i,j) denote the value of the optimal
alignment of substring S[1..i] with the first j
columns of C
y
Signature representations οf
families
• The major collections of signatures in protein are the
ΡROSΙTE database and the BLOCKS database derived from it.
• Helicases are proteins that help unwind double-stranded DNΑ
so that the DNA can be read for duplication, transcription,
recombination, οr repair.
• Α large fraction of the available information on the structure
and possible functions of the helicases has been obtained by
computer- assisted comparative analysis of their amino acid
sequences. This approach has led to the delineation of motifs
and patterns that are conserved in different subsets of the
helicases.
Introduction to computing multiple
string alignments
• Definition: Given a set of k>2 strings
S={S1,S2,..,Sk}, a local multiple alignment of S is
obtained by selecting one substring Si’ from each
string S  S and then globally aligning those
substrings
i
How to score multiple alignments
• Definition: Given a multiple alignment M, the induced pairwise alignment of two
strings Si and Sj is obtained from M by removing all rows except the two rows for Si
and Sj. That is, the induced alignment is the multiple alignment M restricted to Si and
Sj. Any two opposing spaces in that induced alignment can be removed if desired.
• Definition: The score of an induced pairwise alignment is determined using any
chosen scoring scheme for two-string alignment in the standard manner.
Multiple alignment with the sum-ofpairs (SP) objective function
• Definition: The sum of pairs (SP) score of a
multiple alignment M is the sum of the scores of
pairwise global alignments induced by M.
• The SΡ alignment problem Compute a global
multiple alignment M with minimum sιm-ofpairs score.
An exact solution to the SP alignment
problem
• Definition: Let S1, S2 and S3 denote three strings
of lengths n1, n2 and n3, respectively, and let
D(i,j,k) be the optimal SP score for aligning
S1[1..i], S2[1..j] and S3[1..k]. The score for a
match, mismatch, or space is specified by the
variables smatch, smis, and sspace, respectively.
Recurrences fοr a nonbοundary
cell(i, j)
For i=1 to n1 do
For j=l to n2 do
For k=l to n3 do
begin
if (S1(i) = S2(j)) then cij = smatch
else cij = smis;
if (S1(i) = S3(k)) then cik= smatch
else cik = smis;
if (S2(j) = S3(k)) then cjk= smatch
else ιjk := smis;
d1 = D(i-1, j-1, k-1) + cij + cik + cjk;
d2 = D(i-1, j-1,k) + cij + 2*sspace;
d3 = D(i- 1, j, k- 1) + cik + 2xsspace;
d4 = D(i, j- 1,k-1) + cjk + 2*sspace;
d5 = D(i-1, j, k) + 2*sspace;
d6 = D(i, j- 1, k) + 2*sspace;
d7 = D(i, j, k- 1) + 2*sspace;
D(i, j, k) :: Min[d1, d2, d3, d4, d5, d6, d7];
end;
A speedup for the exact solution
• Definition: Let d1,2(i,j) be the edit distance
between suffixes S1[l..n] and S2[j..n] of strings S1
and S2. Define d1,3(i,k) and d2,3(j,k) analogously.
• Key idea Recall that D(i, j,k) is the optimal SP
score for aligning S1[1..i], S2[1.. j],and S3[1 ..k).
If D(i, j, k) + d1,2(i, j) + d1,3(i, k) + d2,3( j, k) is
greater than z then node (i, j, k) cannot be on
any optimal path and so (in a forward
computation) D(i, j, k) need not be sent forward
to any cell.