Multiple Sequence Alignment

Transcript Multiple Sequence Alignment

Computational Molecular Biology
Multiple Sequence Alignment
Sequence Alignment
 Problem Definition:
 Given: 2 DNA or protein sequences
 Find: Best match between them
 What is an Alignment:
 Given: 2 Strings S and S’
 Goal: The lengths of S and S’ are the same by inserting
spaces (--; sometimes denote as ∆) into these strings
A
--
T
C
--
A
--
C
T
C
A
A
My T. Thai
[email protected]
2
Matches, Mismatches and Indels
 Match: two aligned, identical characters in an
alignment
 Mismatch: two aligned, unequal characters
 Indel: A character aligned with a space
A A C T A C T -- C C T A A C A C T -- --- -- C T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
My T. Thai
[email protected]
3
Basic Algorithmic Problem
 Find the alignment of the two strings that:
 max m where m = (# matches – mismatches –
indels)
 Or min m where m is the SP-score of an alignment
 m defines the similarity of the two strings, also
called Optimal Global Alignment
 Biologically: a mismatch represents a mutation,
whereas an indel represents a historical
insertion or deletion of a single character
My T. Thai
[email protected]
4
Multiple Sequence Alignment
 Problem Definition:
 Similar to the sequence alignment problem but the
input has more than 2 strings
 Challenges:
 NP-hard, MAX-SNP
 Guarantee factor: 2 – 2/k where k is the number of
the input sequences.
 More work to reduce the time and space complexity
My T. Thai
[email protected]
5
Sum of Pairs Score (SP-Score)
 Given a finite alphabet  and     {} where ∆
denotes a space
 Consider k sequences over  that we want to
align. After an alignment, each sequence has
length l
 A score d is assigned to each pair of letters:
My T. Thai
[email protected]
6
SP-Score
 The SP-Score of an alignment A is defined as:
 Consider a matrix of l columns and k rows where
the rows represents the sequences and columns
represent the letters
 SP-Score is the sum of the scores of all columns:
 Score of each column is the sum of the scores of all
distinct unordered pairs of letters in the column
 Or we can view as sum of pairwise sequence
alignment values.
 Find an (optimal) alignment to minimize the
SP-Score value
My T. Thai
[email protected]
7
Proving MSA with SP-Score that is a Metric
is NP-hard
My T. Thai
[email protected]
8
Some Notations
My T. Thai
[email protected]
9
Some Basic Properties
 Lemma 1: Let s1, s2 be two sequences over Σ
such that l1=|s1|, l2=|s2|, l2≥l1 and there are m
symbols of s1 that are not in s2. Then every
alignment of the set {s1,s2} has at least m+l2-l1
mismatches
My T. Thai
[email protected]
10
My T. Thai
[email protected]
11
The construction
 Reduce the vertex cover (or node cover) to MSA.
 Vertex cover:
 Instance: A graph G=(V,E) and an integer k≤|V|
 Question: Is there a vertex cover V1 of G of size k or less?
 MSA:
 Instance: A set S={s1, …, sn} of finite sequences over a fixed
alphabet Σ, an SP-score and an integer C
 Question: Is there a multiple alignment of the sequences in S that is
of value C or less?
My T. Thai
[email protected]
12
SP-Score (alphabet of size 6)
My T. Thai
[email protected]
13
The Reduction
So, we have
, T is a set of C2
sequences t and X contains C1 sequences x(k), where C1 and C2 will be determined
later
My T. Thai
[email protected]
14
An Example
My T. Thai
[email protected]
15
Intuition
 By the above construction, an optimal alignment A of S is
obtained when A satisfies certain properties (called
standard alignment)
 The value of standard alignment is bounded by a given
threshold C only where G has a vertex cover of size k
 How to obtain:
 Force d’s of the test sequences to be aligned with b’s of the edge
sequences
 Only one b of each edge sequence can be aligned to a d
 The number of such alignment determines the value of the
alignment
My T. Thai
[email protected]
16
Standard Alignemnt
My T. Thai
[email protected]
17
My T. Thai
[email protected]
18
My T. Thai
[email protected]
19
My T. Thai
[email protected]
20
My T. Thai
[email protected]
21
 Let US and US,X denote the upper bounds of
D(AS) and D(AS,X) respectively
 By Corollary 8 and Lemma 9, we have the
standard alignment has value not greater than
DSD + US + US,X
 where DSD = D(AX) + D(AT) + D(AX,T) + D(AS,T)
over a standard alignment A
 Now, let C1 > US and C2 > US + US,X, we can
prove that an optimal alignment must be a
standard one
My T. Thai
[email protected]
22
My T. Thai
[email protected]
23
My T. Thai
[email protected]
24
Show the NP-hardness of any scoring matrix in
a broad class M
Show that there is a scoring matrix M0 such
that MSA for M0 is MAX-SNP hard
My T. Thai
[email protected]
25
Interesting Observation
 Via the brute force, optimal MSA contains very
few gaps
 Suggesting the study of gap limitations:
 Have an upper bound of the number of gaps one can insert during
the alignment
 Special case:
 Gap-0: No gap allows, but we can shift the strings for an alignment
(insert gaps at the beginning or at the end of a string)
 Gap-0-1: a gap-0 alignment such that the gaps at the beginning or at
the end of each string is exactly one space
My T. Thai
[email protected]
26
Problem Definition
 Given a finite alphabet   {a1 ,, aw}
 Scoring matrix M ( w1)( w1)  ( si , j )i w, j  w
 For i, j > 0, si,j represents the penalty for aligning ai
with aj
 For i > 0, s0,i and si,0 are called indel penalites
 Gap opening penalties (in addition to the indel
penalties) for aligning ai with the first or last ∆ in
the string of ∆’s
My T. Thai
[email protected]
27
Generic Scoring Matrix
Where Σ={A,T}, x, y, x are fixed nonnegative numbers
and u > max{0, vA, vT} holds
• Let M2 be the class of all scoring matrices that contain a generic submatrix M
• Let M1 be the class of all scoring matrices that contain a sub-matrix isomorphic
to a generic matrix M with z > vT.
• Let M be the class of all scoring matrices that contain a submatrix isomorphic
to a generic matrix M with y > u and z > vT.
Theorem 1:
(a) The gap-0-1 multiple alignment problem is NP-hard for every scoring matrix M
in M2.
(b) The gap-0 multiple alignment problem is NP-hard for every M in M1
(c) The multiple alignment problem is NP-hard for every M in M
Note that M is quite broad and covers most scoring schemes used in
biological applications.
My T. Thai
[email protected]
28
Reduction
 Reduce the MAX-CUT-B:
 Given G=(V,E) where k=|V| and each vertex has a
degree at most B
 Find a partition of V into two disjoint sets such that
to maximize the number of edges crossing these
two sets
 Given a graph G=(V,E) with k vertices v0, …,
vk-1 and l edges e0, …, el-1. We will construct a
set of k2 sequences t0, …, tk -1 as follows:
2
My T. Thai
[email protected]
29
Reduction
 For each vertex vi, construct a sequence ti such
that
 for each edge em={vh, vi} incident at vi, h < i, n < k5,
set
where ti,j represents the character at the jth position in ti.
 For other j, let ti,j = T
 For i ≥ k, set ti = T T T … T with length k12l
My T. Thai
[email protected]
30
An Example
My T. Thai
[email protected]
31
Proof of Theorem 1(a)
 We will show that a gap-0-1 alignment will
partition V into two disjoint subsets V0 and V1:
 V0: all vertices vi such that ti remains in place (a space appends at
the end)
 V1: all vertices vi such that ti shifts to the right
 Thus, based on the alignment, we can find the cut.
And vice versa, based on the cut, we can find the
alignment
 The left part is: prove that if k is sufficiently large,
the optimal gap-0-1 alignment yields a partion of V
with maximum edge cut.
My T. Thai
[email protected]
32
Proof of Theorem 1(a)
 Let c denote the cut based on the alignment A
 Consider all the sequences ti after that alignment A:
 The total indel penalties is of order O(k4) (appears at the first and
last column in the SP score matrix)
 The total number of mismatches before the alignment is 3k5l(k2-1)
 To maximally reduce this number:



1 A-A match reduces 2 A-T mismatches
For each edge (vh, vi), if there are in different subsets (of the partition),
then a total of k5 A-A matches between sequences th and ti are created
No other A-T mismatches can be elimiated
 Thus the SP-score:
 k12lvTk2(k2-1)2+3k5l(u-vT)(k2-1)-ck5(2u-vA-vT)+O(k4)
My T. Thai
[email protected]
33
Theorem 2
Consider the following scoring matrix M0 for the
alphabet ∑0 = {A,T,C}.
(a) The gap-0-1 MSA problem is MAX-SNP-hard
(b) The gap-0 MSA problem in MAX-SNP-hard
(c) The MSA problem in MAX-SNP-hard
My T. Thai
[email protected]
34
MAX-SNP-hard Proof
 To prove problem A’ is MAX-SNP-hard, we
need to L-reduce problem A, which is MAXSNP-hard to A’
 L-reduce:
 There are two polynomial-time algorithms f, g and
constants a, b > 0 such that for each instance I of A:
 f produces an instance I’ = f(I) of A’ such that OPT(I’)
≤ aOPT(I)
 Given any solution of I’ with cost c’, g produces a
solution of I with cost c such that |c-OPT(I)| ≤ b|c’OPT(I’)|
My T. Thai
[email protected]
35
Proof of Theorem 2
 To prove MSA (with M0 and the scoring
matrix mentioned before) MAX-SNP-hard:
 L-reduce the MAX-CUT-B to another optimization
problem, called A’, which is L-reduce to a scaled
version of MSA
 Problem A’:
 Given a graph G=(V,E) with bounded degree B. For
every partition P={V0, V1}, let cp be the size of cut
determined by P.
 Find the partition P of V that minimizes dp = 3|E|2cp
My T. Thai
[email protected]
36
Show A’ is MAX-SNP-hard
 Let f and g be an identity function
 Set a = 3B and b = 2, we can easily prove the
two properties of the L-reduction since:
 cp ≥|E|/B and dp = 3|E| - 2 cp ≤ 3 |E|
 Any increase of cp by 1 = decrease dp by 2
My T. Thai
[email protected]
37
Show A’ L-reduce to scaled MSA
Similar to the above construction, we have:
My T. Thai
[email protected]
38
 Similar to the proof of Theorem 1, we have the
optimal SP-score:
where
 If the SP-score is scaled by a factor of k-5/2 for a
MSA of k sequences, then A’ L-reduce to
MSA.
My T. Thai
[email protected]
39
GENETIC ALGORITHMS
How do GAs work?
 Create
a population of random solutions
 Use natural selection:
 crossover and mutation to improve the
solutions
 Stop the operation if satisfying some certain criteria such
as:
 No improvement on fitness function
 The improvement is less than some certain threshold
 The number of iteration is more than some certain
threhold
Terms and Definitions
 Chromosomes
 Potential solutions
 Population
 Collection of chromosomes
 Generations
 Successive populations
Terms and Definitions
 Crossover
 Exchange of genes between two chromosomes
 Mutation
 Random change of one or more genes in a
chromosome
 Elitism
 Copy the best solutions without doing crossover
or mutation.
Terms and Definitions
 Offspring
 New chromosome created by crossover
between two parent chromosomes
 Fitness
function
 Measures how “good” a chromosome is.
 Encoding scheme
 How do we represent every
chromosome/gene?
 Binary, combination, syntax trees.
Why are GAs attractive?
 No need for a particular algorithm to solve
the given problem. Only the fitness function
is required to evaluate the quality of the
solutions.
 Implicitly a parallel technique and can be
implement efficiently on powerful parallel
computers for demanding large scale
problems.
Basic Outline of a GA
 Initial population composed of random chromosomes,
called first generation
 Evaluate the fitness of each chromosome in the
population
 Create a new population:
 Select two parent chromosomes from a population according
to their fitness
 Crossover (with some probability) to form a new offspring
 Mutation (with some probability) to mutate new offspring
 Place new offspring in a new population
 Process is repeated until a satisfactory solution evolves
Operations
Mutation Operation:
• Modify a single parent
• Try to avoid local minima
Let's see some running examples
 Minimum of a function:
 http://cs.felk.cvut.cz/~xobitko/ga/example_f.html
 Elitism:
 http://cs.felk.cvut.cz/~xobitko/ga/params.html
 The travelling salesman problem:
 http://cs.felk.cvut.cz/~xobitko/ga/tspexample.htm
l
Multiple Sequence Alignment
 Fitness function is used to compare the
different alignments
 Based on the number of matching symbols and the
number and size of gaps
 Also called the cost function
 Different weights for different types of matches
 Gap costs
 can be simple and count the total matching symbols
 can be complicated and consider the type of
matching symbols, location in the sequence,
neighboring symbols etc.
Approximation Algorithms
My T. Thai
[email protected]
51
Scoring method
 Score zero for a match or for two opposing
spaces
 Score one for a mismatch or for a character
opposite a space
Assumptions:
 Assume that two opposing spaces have a zero
value
 Assume other values satisfies triangle
inequality
 s(x,z) ≤ s(x,y) + s(y,z)
 s(x,z) – cost of transforming character x into
character z
Objective Functions
 Two objective functions
 SP
 The sum of the values of pairwise alignments induced
by an alignment A
 TA
 Using the topology of the tree, map the strings to the
nodes of the tree
 The sum of the selected pairwise alignments is called
tree alignment
Center Star Method
 For a set of k strings X
 Choose a center string Xc of X which minimizes
Σj≠cD(Xc,Xj)
 Let M = min Σj≠cD(Xc,Xj)
 Center star is a star tree of k nodes with the center
node labeled Xc and each of the k-1 remaining
nodes labeled by a distinct string in X \ {Xc}
 If Xi and Xj are strings labeling adjacent nodes of
tree T, then alignment of Xi and Xj induced by A(T)
has value D(Xi,Xj)
Center Star Method – Alg Ac
 Do an optimal alignment for each pair (Xc, Xj)
for all j ≠ c
 s0 = max number of spaces placed before the
first char of Xc
 sf = max number of spaces placed after the last
char of Xc
 si = max number of spaces placed between Xc(i)
and Xc(i+1)
Center Star Method – Alg Ac
 For Xc, insert s0, si, and sf spaces at the
beginning, between, and the end of Xc
respectively. Call X’c
 Then for each Xj, do the optimal alignment
without modifying X’c
My T. Thai
[email protected]
57
Analysis
 d(Xi,Xj) ≥ D(Xi,Xj)
 V(Ac) = Σi<jd(Xi,Xj)
 V(Ac) is at most twice the value of the optimal
multiple alignment of X
My T. Thai
[email protected]
58
Analysis
 Lemma 3.1: For any 2 strings Xi,Xj, we
have:
d(Xi,Xj) ≤ d(Xi,Xc) + d(Xc,Xj)
= D(Xi,Xc) + D(Xc,Xj)
 triangle inequality
Analysis
 A* be the optimal multiple alignment of k
strings X
 Define: V(A*) = Σi<jd*(Xi,Xj)
Analysis
 Theorem 3.1
V(Ac) / V(A*) ≤ 2(k-1)/ k < 2
 Proof:
Disadvantages
 Requires all pairwise alignments
 Computationally expensive
 Faster, Randomized alignments




Randomly select string Xi
Build multiple alignment with star centered at Xi
Select best multiple alignment A from p such stars
At most (k-1)p pairwise alignments need to be
computed
Randomized Alignments
 Theorem 3.2
For any r >1, let e(r) be the expected number of
stars needed to be chosen at random before the
value of best resulting alignment is within a
factor of 2+1/(r-1) of the optimal alignment.
Then e(r) ≤ r.
 e(r) is independent of k and the length of the
strings.
Proof of Theorem 3.2
 For r = 2, for each string Xi
define M(i) = ΣjD(Xi,Xj) then M(c) = M
From Theorem 3.1,
Σ(i,j)D(Xi,Xj) = ΣjM(i) ≤ 2(k-1)M so the Avg
value of M(i) < 2 M
 Since min M(i) = M, then Median M(i) < 3M
Number of centers selected before a selected
M(i) is less than the median = 2
Proof
 Suppose median is ∂M for 1 ≤ ∂ ≤ 3
Then Σ(i,j)D(Xi,Xj)≥ kM/2 + k ∂ M/2
 Value of the alignment obtained from any
below median star ≤ 2(k-1) ∂ M
Therefore, error ratio for this star ≤ = 2 ∂ / (1/2
+ ∂ /2)
 When ∂ = 3, error ratio = 3.
 So we have e(2) ≤ 2
Proof
 Now generalize this proof for r > 2
 At least k/r stars have M(i) less than or equal to
(2r-1)M/(r-1)
 Minimum M(i) is M
 Mean < 2M
 expected number of stars to pick with M(i) < ∂
M is r for 1 ≤ ∂ ≤ (2r-1)/(r-1)
 error ratio = 2 ∂ /[1/r + (r-1) ∂ /r]
 (2r-1)/(r-1)=2 + 1/(r-1)
Theorem 3.3
 Picking p stars at random, the best resulting
alignment will have value within a factor of 2 +
1/(r-1) of the optimal with probability at least
1 – [(r-1)/r]p
Center Star Method
 Proof
 From theorem 3.2, if Median value was actually
3M
 For half the stars M(i) = M and M(i) = 3M for the
other half
 Σ(i,j)D(Xi,Xj)=2kM
 optimal SP alignment can be obtained from any
center string Xiwith M(i) = M
 Probability of selecting such a string is one-half
Tree Alignment Method
 Typical approach:
 first find multiple alignment and then build a tree
showing the evolutionary derivations
 Another approach (called tree alignment):
 first choose the typology of the tree and then map
the strings to the nodes of the tree
 Alignment is the pairwise alignments of the strings
at the ends of the edges of the tree
Formal Definitions
 Let K be an input set of k strings
 K’  K be a set of strings containing K
 Evolutionary tree TK’ for K is a tree:
 with at least k nodes
 each string in K’ labels exactly one node & each
node gets exactly one label in K’
 The value of TK’ : V(TK’) = ΣD(X,Y)
 the problem is to find a set of strings K’ and T(K’)
for K which minimizes V (TK’)
 The alignment value D(X,Y ) is interpreted as
the minimum “cost" to transform string X to
string Y
 The sum of the alignment values of the edges
gives the evolutionary cost implied by the tree.
Method
 Let G be a graph with k nodes labeled with a
distinct string in K
 Each edge (X,Y) has a weight D(X,Y)
 Find the MST of G. This MST is an
evolutionary tree for K
Analysis
 T* denote the optimal evolutionary tree for K.
 Prove: V(MST)/V(T*) < 2OPT
 Let C be a traversal of edges of T* which
traverses everyy edge exactly once in each
direction
 Let C1, …, Ck be the order that C encounters
 Let V(C) = D(Ck,C1) + Σi<kD(Ci,Ci+1)
Analysis
My T. Thai
[email protected]
74
Analysis
 Corollary 4.1: V(C) ≤ 2V(T*),
 Let D(Ci*,Ci*+1) be the largest distance of any
adjacent strings in C traversal
 Lemma(4.2)
V(MST) ≤ V(C) – D(Ci*,Ci*+1) ≤ V(C) – V(C)/K
Analysis
 Theorem 4.1
For any set K of k strings, we have:
V(MST)/ V(T*k) ≤ 2(k-1)/k < 2
 Theorem 4.2
V(MST) / V(T*k) ≤ (k-1)/k V(C)/V(T*k) ≤ 2 (k-1)/k
 Corollary 4.2
V(T*k) > kV(MST)/2(k-1)
Constrained MSA
Motivation
General SP MSA problem:
 NP-completeness has already been established
 Appromixation algorithms have been developed
 Heuristics are also avaliable
Constrained MSA:
 Biologists often have additional knowledge of data (e.g. active site
residues)
 Additional knowledge can specify matches at certain locations
 Models allow users to provide additional constraints
Definition of CMSA Problem
 Suppose that P = p1p2 . . . pα is a common
subsequence of S1, S2, . . . , SK
 The constrained multiple sequence alignment of S
with respect to P is:
 an MSA A with the constraints that there are α columns
in A, c1, c2, . . . , cα with c1 < c2 < …< cα, such that the
characters of column ci, 1 ≤ i ≤ α, are all equal to pi.
Optimal CPSA
Dynamic Algorithm
My T. Thai
[email protected]
81
Time and Space Complexities
My T. Thai
[email protected]
82
CMSA
The improvement of CPSA in turn improves the time & space complexity of
Progressive CMSA from O(αkn4) and O(αn4) to O(αk2n2) and O(αn2).
Optimal CMSA
This Optimal CMSA algorithm involves the creation of a matrix with k+1 dimensions.
(Assume δ(x,y) is the distance function and satisfies the triangle inequality.)

Let D(i1, . . . , ik; γ) be the optimal CMSA score matrix for
{S1[1..i1], . . . , Sk[1..ik]} where P[1..γ] is aligned in γ columns.

Then optimal alignment score is D(n1, . . . , nk; α), where ni =|Si|.
Computing D:

D({0}k; 0) = 0

Let εj = 0 or 1 with εjSj[ij] where j = 0 represents a space, and
δ(x1, . . . , xk) = Σ1≤i<j≤kδ(xi, xj).
D(i1, i2, . . . , ik; γ) is the minimum of:

if S1[i1] = . . . = Sk[ik] = P[γ],


D(i1 − 1, . . . , ik − 1; γ − 1) + δ(S1[i1], . . . , Sk[ik])
minε∈{0,1}k (D(i1 − ε1, . . . , ik − εk; γ) + δ(ε1S1[i1], . . . , εkSk[ik])).
These values can be computed using dynamic programming.
CMSA (Center Star)
The Center-Star method proposed for the general
MSA problem can be modified to apply to the CMSA
problem.
 Consider each sequence as the center, Sc. Consider each
list position that Sc is aligned with P.
 Find the minimum star-sum score Sc.
 Create a constrained alignment matrix by merging the
constrained pairwise sequence alignments between Sc
& S j.
CMSA (Center Star)
The recurrence of Thm. 3.1 is
only slightly modified:
Example
My T. Thai
[email protected]
86

Multiple Sequence Alignment

Transcript Multiple Sequence Alignment

Directory