Multiple Sequence Alignment
Download
Report
Transcript Multiple Sequence Alignment
Computational Molecular Biology
Multiple Sequence Alignment
Sequence Alignment
Problem Definition:
Given: 2 DNA or protein sequences
Find: Best match between them
What is an Alignment:
Given: 2 Strings S and S’
Goal: The lengths of S and S’ are the same by inserting
spaces (--; sometimes denote as ∆) into these strings
A
--
T
C
--
A
--
C
T
C
A
A
My T. Thai
[email protected]
2
Matches, Mismatches and Indels
Match: two aligned, identical characters in an
alignment
Mismatch: two aligned, unequal characters
Indel: A character aligned with a space
A A C T A C T -- C C T A A C A C T -- --- -- C T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
My T. Thai
[email protected]
3
Basic Algorithmic Problem
Find the alignment of the two strings that:
max m where m = (# matches – mismatches –
indels)
Or min m where m is the SP-score of an alignment
m defines the similarity of the two strings, also
called Optimal Global Alignment
Biologically: a mismatch represents a mutation,
whereas an indel represents a historical
insertion or deletion of a single character
My T. Thai
[email protected]
4
Multiple Sequence Alignment
Problem Definition:
Similar to the sequence alignment problem but the
input has more than 2 strings
Challenges:
NP-hard, MAX-SNP
Guarantee factor: 2 – 2/k where k is the number of
the input sequences.
More work to reduce the time and space complexity
My T. Thai
[email protected]
5
Sum of Pairs Score (SP-Score)
Given a finite alphabet and {} where ∆
denotes a space
Consider k sequences over that we want to
align. After an alignment, each sequence has
length l
A score d is assigned to each pair of letters:
My T. Thai
[email protected]
6
SP-Score
The SP-Score of an alignment A is defined as:
Consider a matrix of l columns and k rows where
the rows represents the sequences and columns
represent the letters
SP-Score is the sum of the scores of all columns:
Score of each column is the sum of the scores of all
distinct unordered pairs of letters in the column
Or we can view as sum of pairwise sequence
alignment values.
Find an (optimal) alignment to minimize the
SP-Score value
My T. Thai
[email protected]
7
Proving MSA with SP-Score that is a Metric
is NP-hard
My T. Thai
[email protected]
8
Some Notations
My T. Thai
[email protected]
9
Some Basic Properties
Lemma 1: Let s1, s2 be two sequences over Σ
such that l1=|s1|, l2=|s2|, l2≥l1 and there are m
symbols of s1 that are not in s2. Then every
alignment of the set {s1,s2} has at least m+l2-l1
mismatches
My T. Thai
[email protected]
10
My T. Thai
[email protected]
11
The construction
Reduce the vertex cover (or node cover) to MSA.
Vertex cover:
Instance: A graph G=(V,E) and an integer k≤|V|
Question: Is there a vertex cover V1 of G of size k or less?
MSA:
Instance: A set S={s1, …, sn} of finite sequences over a fixed
alphabet Σ, an SP-score and an integer C
Question: Is there a multiple alignment of the sequences in S that is
of value C or less?
My T. Thai
[email protected]
12
SP-Score (alphabet of size 6)
My T. Thai
[email protected]
13
The Reduction
So, we have
, T is a set of C2
sequences t and X contains C1 sequences x(k), where C1 and C2 will be determined
later
My T. Thai
[email protected]
14
An Example
My T. Thai
[email protected]
15
Intuition
By the above construction, an optimal alignment A of S is
obtained when A satisfies certain properties (called
standard alignment)
The value of standard alignment is bounded by a given
threshold C only where G has a vertex cover of size k
How to obtain:
Force d’s of the test sequences to be aligned with b’s of the edge
sequences
Only one b of each edge sequence can be aligned to a d
The number of such alignment determines the value of the
alignment
My T. Thai
[email protected]
16
Standard Alignemnt
My T. Thai
[email protected]
17
My T. Thai
[email protected]
18
My T. Thai
[email protected]
19
My T. Thai
[email protected]
20
My T. Thai
[email protected]
21
Let US and US,X denote the upper bounds of
D(AS) and D(AS,X) respectively
By Corollary 8 and Lemma 9, we have the
standard alignment has value not greater than
DSD + US + US,X
where DSD = D(AX) + D(AT) + D(AX,T) + D(AS,T)
over a standard alignment A
Now, let C1 > US and C2 > US + US,X, we can
prove that an optimal alignment must be a
standard one
My T. Thai
[email protected]
22
My T. Thai
[email protected]
23
My T. Thai
[email protected]
24
Show the NP-hardness of any scoring matrix in
a broad class M
Show that there is a scoring matrix M0 such
that MSA for M0 is MAX-SNP hard
My T. Thai
[email protected]
25
Interesting Observation
Via the brute force, optimal MSA contains very
few gaps
Suggesting the study of gap limitations:
Have an upper bound of the number of gaps one can insert during
the alignment
Special case:
Gap-0: No gap allows, but we can shift the strings for an alignment
(insert gaps at the beginning or at the end of a string)
Gap-0-1: a gap-0 alignment such that the gaps at the beginning or at
the end of each string is exactly one space
My T. Thai
[email protected]
26
Problem Definition
Given a finite alphabet {a1 ,, aw}
Scoring matrix M ( w1)( w1) ( si , j )i w, j w
For i, j > 0, si,j represents the penalty for aligning ai
with aj
For i > 0, s0,i and si,0 are called indel penalites
Gap opening penalties (in addition to the indel
penalties) for aligning ai with the first or last ∆ in
the string of ∆’s
My T. Thai
[email protected]
27
Generic Scoring Matrix
Where Σ={A,T}, x, y, x are fixed nonnegative numbers
and u > max{0, vA, vT} holds
• Let M2 be the class of all scoring matrices that contain a generic submatrix M
• Let M1 be the class of all scoring matrices that contain a sub-matrix isomorphic
to a generic matrix M with z > vT.
• Let M be the class of all scoring matrices that contain a submatrix isomorphic
to a generic matrix M with y > u and z > vT.
Theorem 1:
(a) The gap-0-1 multiple alignment problem is NP-hard for every scoring matrix M
in M2.
(b) The gap-0 multiple alignment problem is NP-hard for every M in M1
(c) The multiple alignment problem is NP-hard for every M in M
Note that M is quite broad and covers most scoring schemes used in
biological applications.
My T. Thai
[email protected]
28
Reduction
Reduce the MAX-CUT-B:
Given G=(V,E) where k=|V| and each vertex has a
degree at most B
Find a partition of V into two disjoint sets such that
to maximize the number of edges crossing these
two sets
Given a graph G=(V,E) with k vertices v0, …,
vk-1 and l edges e0, …, el-1. We will construct a
set of k2 sequences t0, …, tk -1 as follows:
2
My T. Thai
[email protected]
29
Reduction
For each vertex vi, construct a sequence ti such
that
for each edge em={vh, vi} incident at vi, h < i, n < k5,
set
where ti,j represents the character at the jth position in ti.
For other j, let ti,j = T
For i ≥ k, set ti = T T T … T with length k12l
My T. Thai
[email protected]
30
An Example
My T. Thai
[email protected]
31
Proof of Theorem 1(a)
We will show that a gap-0-1 alignment will
partition V into two disjoint subsets V0 and V1:
V0: all vertices vi such that ti remains in place (a space appends at
the end)
V1: all vertices vi such that ti shifts to the right
Thus, based on the alignment, we can find the cut.
And vice versa, based on the cut, we can find the
alignment
The left part is: prove that if k is sufficiently large,
the optimal gap-0-1 alignment yields a partion of V
with maximum edge cut.
My T. Thai
[email protected]
32
Proof of Theorem 1(a)
Let c denote the cut based on the alignment A
Consider all the sequences ti after that alignment A:
The total indel penalties is of order O(k4) (appears at the first and
last column in the SP score matrix)
The total number of mismatches before the alignment is 3k5l(k2-1)
To maximally reduce this number:
1 A-A match reduces 2 A-T mismatches
For each edge (vh, vi), if there are in different subsets (of the partition),
then a total of k5 A-A matches between sequences th and ti are created
No other A-T mismatches can be elimiated
Thus the SP-score:
k12lvTk2(k2-1)2+3k5l(u-vT)(k2-1)-ck5(2u-vA-vT)+O(k4)
My T. Thai
[email protected]
33
Theorem 2
Consider the following scoring matrix M0 for the
alphabet ∑0 = {A,T,C}.
(a) The gap-0-1 MSA problem is MAX-SNP-hard
(b) The gap-0 MSA problem in MAX-SNP-hard
(c) The MSA problem in MAX-SNP-hard
My T. Thai
[email protected]
34
MAX-SNP-hard Proof
To prove problem A’ is MAX-SNP-hard, we
need to L-reduce problem A, which is MAXSNP-hard to A’
L-reduce:
There are two polynomial-time algorithms f, g and
constants a, b > 0 such that for each instance I of A:
f produces an instance I’ = f(I) of A’ such that OPT(I’)
≤ aOPT(I)
Given any solution of I’ with cost c’, g produces a
solution of I with cost c such that |c-OPT(I)| ≤ b|c’OPT(I’)|
My T. Thai
[email protected]
35
Proof of Theorem 2
To prove MSA (with M0 and the scoring
matrix mentioned before) MAX-SNP-hard:
L-reduce the MAX-CUT-B to another optimization
problem, called A’, which is L-reduce to a scaled
version of MSA
Problem A’:
Given a graph G=(V,E) with bounded degree B. For
every partition P={V0, V1}, let cp be the size of cut
determined by P.
Find the partition P of V that minimizes dp = 3|E|2cp
My T. Thai
[email protected]
36
Show A’ is MAX-SNP-hard
Let f and g be an identity function
Set a = 3B and b = 2, we can easily prove the
two properties of the L-reduction since:
cp ≥|E|/B and dp = 3|E| - 2 cp ≤ 3 |E|
Any increase of cp by 1 = decrease dp by 2
My T. Thai
[email protected]
37
Show A’ L-reduce to scaled MSA
Similar to the above construction, we have:
My T. Thai
[email protected]
38
Similar to the proof of Theorem 1, we have the
optimal SP-score:
where
If the SP-score is scaled by a factor of k-5/2 for a
MSA of k sequences, then A’ L-reduce to
MSA.
My T. Thai
[email protected]
39
GENETIC ALGORITHMS
How do GAs work?
Create
a population of random solutions
Use natural selection:
crossover and mutation to improve the
solutions
Stop the operation if satisfying some certain criteria such
as:
No improvement on fitness function
The improvement is less than some certain threshold
The number of iteration is more than some certain
threhold
Terms and Definitions
Chromosomes
Potential solutions
Population
Collection of chromosomes
Generations
Successive populations
Terms and Definitions
Crossover
Exchange of genes between two chromosomes
Mutation
Random change of one or more genes in a
chromosome
Elitism
Copy the best solutions without doing crossover
or mutation.
Terms and Definitions
Offspring
New chromosome created by crossover
between two parent chromosomes
Fitness
function
Measures how “good” a chromosome is.
Encoding scheme
How do we represent every
chromosome/gene?
Binary, combination, syntax trees.
Why are GAs attractive?
No need for a particular algorithm to solve
the given problem. Only the fitness function
is required to evaluate the quality of the
solutions.
Implicitly a parallel technique and can be
implement efficiently on powerful parallel
computers for demanding large scale
problems.
Basic Outline of a GA
Initial population composed of random chromosomes,
called first generation
Evaluate the fitness of each chromosome in the
population
Create a new population:
Select two parent chromosomes from a population according
to their fitness
Crossover (with some probability) to form a new offspring
Mutation (with some probability) to mutate new offspring
Place new offspring in a new population
Process is repeated until a satisfactory solution evolves
Operations
Mutation Operation:
• Modify a single parent
• Try to avoid local minima
Let's see some running examples
Minimum of a function:
http://cs.felk.cvut.cz/~xobitko/ga/example_f.html
Elitism:
http://cs.felk.cvut.cz/~xobitko/ga/params.html
The travelling salesman problem:
http://cs.felk.cvut.cz/~xobitko/ga/tspexample.htm
l
Multiple Sequence Alignment
Fitness function is used to compare the
different alignments
Based on the number of matching symbols and the
number and size of gaps
Also called the cost function
Different weights for different types of matches
Gap costs
can be simple and count the total matching symbols
can be complicated and consider the type of
matching symbols, location in the sequence,
neighboring symbols etc.
Approximation Algorithms
My T. Thai
[email protected]
51
Scoring method
Score zero for a match or for two opposing
spaces
Score one for a mismatch or for a character
opposite a space
Assumptions:
Assume that two opposing spaces have a zero
value
Assume other values satisfies triangle
inequality
s(x,z) ≤ s(x,y) + s(y,z)
s(x,z) – cost of transforming character x into
character z
Objective Functions
Two objective functions
SP
The sum of the values of pairwise alignments induced
by an alignment A
TA
Using the topology of the tree, map the strings to the
nodes of the tree
The sum of the selected pairwise alignments is called
tree alignment
Center Star Method
For a set of k strings X
Choose a center string Xc of X which minimizes
Σj≠cD(Xc,Xj)
Let M = min Σj≠cD(Xc,Xj)
Center star is a star tree of k nodes with the center
node labeled Xc and each of the k-1 remaining
nodes labeled by a distinct string in X \ {Xc}
If Xi and Xj are strings labeling adjacent nodes of
tree T, then alignment of Xi and Xj induced by A(T)
has value D(Xi,Xj)
Center Star Method – Alg Ac
Do an optimal alignment for each pair (Xc, Xj)
for all j ≠ c
s0 = max number of spaces placed before the
first char of Xc
sf = max number of spaces placed after the last
char of Xc
si = max number of spaces placed between Xc(i)
and Xc(i+1)
Center Star Method – Alg Ac
For Xc, insert s0, si, and sf spaces at the
beginning, between, and the end of Xc
respectively. Call X’c
Then for each Xj, do the optimal alignment
without modifying X’c
My T. Thai
[email protected]
57
Analysis
d(Xi,Xj) ≥ D(Xi,Xj)
V(Ac) = Σi<jd(Xi,Xj)
V(Ac) is at most twice the value of the optimal
multiple alignment of X
My T. Thai
[email protected]
58
Analysis
Lemma 3.1: For any 2 strings Xi,Xj, we
have:
d(Xi,Xj) ≤ d(Xi,Xc) + d(Xc,Xj)
= D(Xi,Xc) + D(Xc,Xj)
triangle inequality
Analysis
A* be the optimal multiple alignment of k
strings X
Define: V(A*) = Σi<jd*(Xi,Xj)
Analysis
Theorem 3.1
V(Ac) / V(A*) ≤ 2(k-1)/ k < 2
Proof:
Disadvantages
Requires all pairwise alignments
Computationally expensive
Faster, Randomized alignments
Randomly select string Xi
Build multiple alignment with star centered at Xi
Select best multiple alignment A from p such stars
At most (k-1)p pairwise alignments need to be
computed
Randomized Alignments
Theorem 3.2
For any r >1, let e(r) be the expected number of
stars needed to be chosen at random before the
value of best resulting alignment is within a
factor of 2+1/(r-1) of the optimal alignment.
Then e(r) ≤ r.
e(r) is independent of k and the length of the
strings.
Proof of Theorem 3.2
For r = 2, for each string Xi
define M(i) = ΣjD(Xi,Xj) then M(c) = M
From Theorem 3.1,
Σ(i,j)D(Xi,Xj) = ΣjM(i) ≤ 2(k-1)M so the Avg
value of M(i) < 2 M
Since min M(i) = M, then Median M(i) < 3M
Number of centers selected before a selected
M(i) is less than the median = 2
Proof
Suppose median is ∂M for 1 ≤ ∂ ≤ 3
Then Σ(i,j)D(Xi,Xj)≥ kM/2 + k ∂ M/2
Value of the alignment obtained from any
below median star ≤ 2(k-1) ∂ M
Therefore, error ratio for this star ≤ = 2 ∂ / (1/2
+ ∂ /2)
When ∂ = 3, error ratio = 3.
So we have e(2) ≤ 2
Proof
Now generalize this proof for r > 2
At least k/r stars have M(i) less than or equal to
(2r-1)M/(r-1)
Minimum M(i) is M
Mean < 2M
expected number of stars to pick with M(i) < ∂
M is r for 1 ≤ ∂ ≤ (2r-1)/(r-1)
error ratio = 2 ∂ /[1/r + (r-1) ∂ /r]
(2r-1)/(r-1)=2 + 1/(r-1)
Theorem 3.3
Picking p stars at random, the best resulting
alignment will have value within a factor of 2 +
1/(r-1) of the optimal with probability at least
1 – [(r-1)/r]p
Center Star Method
Proof
From theorem 3.2, if Median value was actually
3M
For half the stars M(i) = M and M(i) = 3M for the
other half
Σ(i,j)D(Xi,Xj)=2kM
optimal SP alignment can be obtained from any
center string Xiwith M(i) = M
Probability of selecting such a string is one-half
Tree Alignment Method
Typical approach:
first find multiple alignment and then build a tree
showing the evolutionary derivations
Another approach (called tree alignment):
first choose the typology of the tree and then map
the strings to the nodes of the tree
Alignment is the pairwise alignments of the strings
at the ends of the edges of the tree
Formal Definitions
Let K be an input set of k strings
K’ K be a set of strings containing K
Evolutionary tree TK’ for K is a tree:
with at least k nodes
each string in K’ labels exactly one node & each
node gets exactly one label in K’
The value of TK’ : V(TK’) = ΣD(X,Y)
the problem is to find a set of strings K’ and T(K’)
for K which minimizes V (TK’)
The alignment value D(X,Y ) is interpreted as
the minimum “cost" to transform string X to
string Y
The sum of the alignment values of the edges
gives the evolutionary cost implied by the tree.
Method
Let G be a graph with k nodes labeled with a
distinct string in K
Each edge (X,Y) has a weight D(X,Y)
Find the MST of G. This MST is an
evolutionary tree for K
Analysis
T* denote the optimal evolutionary tree for K.
Prove: V(MST)/V(T*) < 2OPT
Let C be a traversal of edges of T* which
traverses everyy edge exactly once in each
direction
Let C1, …, Ck be the order that C encounters
Let V(C) = D(Ck,C1) + Σi<kD(Ci,Ci+1)
Analysis
My T. Thai
[email protected]
74
Analysis
Corollary 4.1: V(C) ≤ 2V(T*),
Let D(Ci*,Ci*+1) be the largest distance of any
adjacent strings in C traversal
Lemma(4.2)
V(MST) ≤ V(C) – D(Ci*,Ci*+1) ≤ V(C) – V(C)/K
Analysis
Theorem 4.1
For any set K of k strings, we have:
V(MST)/ V(T*k) ≤ 2(k-1)/k < 2
Theorem 4.2
V(MST) / V(T*k) ≤ (k-1)/k V(C)/V(T*k) ≤ 2 (k-1)/k
Corollary 4.2
V(T*k) > kV(MST)/2(k-1)
Constrained MSA
Motivation
General SP MSA problem:
NP-completeness has already been established
Appromixation algorithms have been developed
Heuristics are also avaliable
Constrained MSA:
Biologists often have additional knowledge of data (e.g. active site
residues)
Additional knowledge can specify matches at certain locations
Models allow users to provide additional constraints
Definition of CMSA Problem
Suppose that P = p1p2 . . . pα is a common
subsequence of S1, S2, . . . , SK
The constrained multiple sequence alignment of S
with respect to P is:
an MSA A with the constraints that there are α columns
in A, c1, c2, . . . , cα with c1 < c2 < …< cα, such that the
characters of column ci, 1 ≤ i ≤ α, are all equal to pi.
Optimal CPSA
Dynamic Algorithm
My T. Thai
[email protected]
81
Time and Space Complexities
My T. Thai
[email protected]
82
CMSA
The improvement of CPSA in turn improves the time & space complexity of
Progressive CMSA from O(αkn4) and O(αn4) to O(αk2n2) and O(αn2).
Optimal CMSA
This Optimal CMSA algorithm involves the creation of a matrix with k+1 dimensions.
(Assume δ(x,y) is the distance function and satisfies the triangle inequality.)
Let D(i1, . . . , ik; γ) be the optimal CMSA score matrix for
{S1[1..i1], . . . , Sk[1..ik]} where P[1..γ] is aligned in γ columns.
Then optimal alignment score is D(n1, . . . , nk; α), where ni =|Si|.
Computing D:
D({0}k; 0) = 0
Let εj = 0 or 1 with εjSj[ij] where j = 0 represents a space, and
δ(x1, . . . , xk) = Σ1≤i<j≤kδ(xi, xj).
D(i1, i2, . . . , ik; γ) is the minimum of:
if S1[i1] = . . . = Sk[ik] = P[γ],
D(i1 − 1, . . . , ik − 1; γ − 1) + δ(S1[i1], . . . , Sk[ik])
minε∈{0,1}k (D(i1 − ε1, . . . , ik − εk; γ) + δ(ε1S1[i1], . . . , εkSk[ik])).
These values can be computed using dynamic programming.
CMSA (Center Star)
The Center-Star method proposed for the general
MSA problem can be modified to apply to the CMSA
problem.
Consider each sequence as the center, Sc. Consider each
list position that Sc is aligned with P.
Find the minimum star-sum score Sc.
Create a constrained alignment matrix by merging the
constrained pairwise sequence alignments between Sc
& S j.
CMSA (Center Star)
The recurrence of Thm. 3.1 is
only slightly modified:
Example
My T. Thai
[email protected]
86