Transcript MSA Theory

Multiple Sequence Alignment
(MSA)
1
Multiple Alignment
• Number of sequences >2
• Global alignment
• Seek an alignment that maximizes score
2
Multiple Alignment
Multiple sequence alignment can be viewed as an extension of
pairwise sequence alignment, but the complexity of the
computation grows exponentially with the number of
sequences being considered and their lengths and, therefore, it is
not feasible to search exhaustively for the optimal alignment
even for even a modest number of short sequences.
For example, there are approximately 1038
possible alignments that can be produced from
five 10-nucleotide-long sequences
3
Possible Alignments
TCGGCTCGAC
GAGCGTGATGGCGAGC--CGTCGTAAC-
TCG-GC-TCGAC
GAGCGT-GAT-G-GCG-AG---C
CG-TCGTA--AC … 1038
4
Multiple Alignment
• In addition to the scores employed by
pairwise alignment, we have additional
scores.
• However, we still seek an alignment that
maximizes score
5
Similarity Scoring Scheme
{4,0,0,0,0} = 1
{3,1,0,0,0} = 0.75
{2,2,0,0,0} = 0.5
{2,1,1,0,0} = 0.5
{1,1,1,1,0} = 0
5 character states: A, C, T, G, –
6
Two Possible Alignments
GTCGTAGTCGGCTCGAC
GTCTAGCGAGCGTGATGCGAAGAGGCGAGC--GCCGTCGCGTCGTAAC-
1*1 + 3*0.75 + 12*0.5 = 9.25
4*1 + 13*0.75 + 2*0.5 = 11.75
GTCGTAGTCG-GC-TCGAC
GTC-TAG-CGAGCGT-GAT
GC-GAAG-AG-GCG-AG-C
GCCGTCG-CG-TCGTA-AC
7
Alignments can be easy or
difficult
GCGGCCCA
GCGGCCCA
GCGTTCCA
GCGTCCCA
GCGGCGCA
***...**
TCAGGTAGTT
TCAGGTAGTT
TCAGCTGGTT
TCAGCTAGTT
TTAGCTAGTT
*.**.*.***
GGTGG
GGTGG
GGTGG
GGTGG
GGTGA
****.
TTGACATG
T-GACATG
TTGGCATG
TTGACATG
TTGACATC
* *.***.
CCGGGG---A AACCG
CCGGTG--GT AAGCC
-CTAGG---A ACGCG Difficult
-CTAGGGAAC ACGCG
-CTCTG---A ACGCG
*...*
. *..*.
Easy
8
Multiple Alignment
• Dynamic programming (exhaustive, exact)
– Consider 2 protein sequences of 100 amino acids
in length.
– If it takes 1002 (103) seconds to exhaustively align
these sequences, then it will take 104 seconds to
align 3 sequences, 105 to align 4 sequences, etc.
– It will take ~1021 seconds to align 20 sequences.
One year is ~3 ✕ 107 seconds. The age of the
visible universe is ~1018 seconds.
• Progressive alignment (heuristic, approximate)
9
Progressive Alignment
• Devised by Feng and Doolittle in 1987.
• Essentially a heuristic method and, as such, not
guaranteed to find the “optimal” or “best” alignment.
• Requires n(n 1) pairwise alignments as a starting
2
point
• One of the first successful implementation was Clustal
(by Des Higgins et al.)

10
• Sum of pairwise alignment scores
– For n sequences, there are n(n-1)/2 pairs
• Based on pairwise alignment scores
– Build n by n table of pairwise scores
• Align similar sequences first
– After alignment, consider as single sequence
– Continue aligning with further sequences
GTCGTAGTCG-GC-TCGAC
GTC-TAG-CGAGCGT-GAT
GC-GAAG-AG-GCG-AG-C
GCCGTCG-CG-TCGTA-AC
12
First step:
A
B
C
D
Compute the pairwise
alignments for all against all
(6 pairwise alignments)
the similarities are stored in a
table
A
B
C
D
A
B
11
C
3
1
D
2
2
10
13
Second step:
A B C D
A
B 11
Cluster the sequences to create a tree
(guide tree):
• Represents the order in which pairs of
sequences are to be aligned
• Similar sequences are neighbors in the
tree
• Distant sequences are distant from each
other in the tree
C
3
1
D
2
2 10
A
B
Guide tree
C
D
14
Guide Tree
A guide tree is not a phylogenetic tree!
15
Third step:
Align most similar pairs
A
B
C
D
Align the alignments as if each
of them was a single sequence
(replace with a single
consensus sequence or use a
profile)
16
Alignment of alignments
M Q T F
L H T W
L Q S W
X
Y
M
L
L
L
M
Q
H
Q
-
T
T
S
T
T
I
I
F
W
W
F
W
L T I F
M T I W
17
Input File
> Usually FASTA format
19
Input file
> Sulfolobus acidocaldarius gi|152927|gp|J03218|SSOATPMA_1
MVSEGRVVRVNGPLVIADGMREAQMFEVVYVSDLKLVGEITRIE
> Thermococcus sp. gi|2605627 ATPase alpha subunit
MGRIIRVTGPLVVADGMKGAKMYEVVRVGEMGLIGEIIRLEGDKAVIQVYEETAGIRPGE
PVEGTGSSLS
> Acetabularia acetabulum gi|1303673|gnl|PID|d1009732 adenosine triphosphatase A subunit
MSKAKEGDYGSIKKVSGPVVVADNMGGSAMYELVRVGTGELIGEIIRLEGDTATIQVYEE
TSGLTVGDGV
…
Output file
20
MSA: Problems
Sequences that are similar only in some smaller
regions may be misaligned because MSA tries to
find global, rather than local alignments.
Sequence that contains a large insertion compared to
the rest may be misaligned because MSA tries to
find global, rather than local alignments.
21
MSA: Problems
Sequence that contains a repetitive element
(such as a domain), while another sequence
only contains one copy.
VS
22
MSA: Problems
• Pairwise alignment is an optimal algorithm.
• Multiple alignment is not an optimal algorithm.
Better alignments might exist!
• The algorithm yields a possible alignment, but not
necessarily the best one.
23
MSA: Problems
• Because of the progressive
methodology, errors at the
beginning of the alignment are
more important than errors at
the end of the alignment.
24
Clustal: Advice
• Sequence weighting (Should we treat all the sequences as equals?)
• Position weighting (Should we treat all positions in the sequences
as equals?)
• Varying substitution matrices
• Residue-specific gap penalties and reduced penalties in hydrophilic
regions (external regions of protein sequences), encourage gaps in
loops rather than in core regions.
• Positions in early alignments where gaps have been opened receive
locally reduced gap penalties to encourage openings in subsequent
alignments
• Vary “gap opening penalty” and “gap extension penalty”
• Discourage too many gaps, too close to one another
• Take into account hydrophilic and hydrophobic structures
• Avoid divergent sequences, as they are the most difficult to align
25
Alignment of protein-coding
DNA sequences
• It is not very sensible to align the DNA sequences
of protein-coding genes.
ATGCTGTTAGGG
ATGCTCGTAGGG
ATGCT-GTTAGGG
ATGCTCGTA-GGG
The result might be implausible and might not reflect what is
known about biological processes.
It is much more sensible to translate the sequences into their
corresponding amino acid sequences, align these protein
sequences and then put the gaps in the DNA sequences according
to the amino acid alignment.
26
Effect of gap penalties on amino-acid alignment
Human pancreatic hormone precursor versus chicken
pancreatic hormone
(a) Penalty for gaps is 0
(b) Penalty for a gap of size k nucleotides is wk = 1 + 0.1k
(c) The same alignment as in (b), only the similarity between
the two sequences is further enhanced by showing pairs of
27
biochemically similar amino acids
Anchored alignment:
The anchored-alignment procedure uses expert knowledge
to improve multiple alignments. The user can specify a list
of anchor points, each of which consists of a pair of equallength segments that are to be aligned by the program.
If residue x from one of the input sequences is paired to residue
y from another sequence, then y is the only residue that can be
aligned to x, and vice versa.
All residues to the left and the right of x are aligned,
respectively, to the residues to the left and the right of y.
28
Consider, for example, the following example:
>seq1 WKKNADAPKRAMTSFMKAAY
>seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD
>seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK
The non-anchored default version of DIALIGN would calculate
the following alignment for this input sequence set:
seq1 1 WKKNAD-----APKRamtsfmKAAY-----------seq2 1 WNLDTN-----SPEE------KQAYiqlaKDDriryd
seq3 1 WRMDSNqknpdSNNP------KAAYn---KGDsnapk
29
Now let's assume, the user has some expert knowledge about a
certain domain that is present in all of the input sequences; the
domains in the three sequences are thought to be homologous to
each other:
>seq1 WKKNADAPKRAMTSFMKAAY
>seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD
>seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK
The user wants to define this motif as anchor and align the rest of
the sequences automatically, given the pre-defined constraints
imposed by this anchor.
Since anchor points are defined as pairs of equal-length segments,
we need two anchor points to enforce alignment of the above
motif.
30
For example, one could choose
Anchor point 1:
>seq1 WKKNADAPKRAMTSFMKAAY
>seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD
>seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK
Anchor point 2:
>seq1 WKKNADAPKRAMTSFMKAAY
>seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD
>seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK
If the above motif is to be aligned by our program, these two
anchor points need to be specified.
31
Format for user-defined anchor points:
To specify a set of anchor points, a file with the coordinates of
these anchor points is needed.
Since each anchor point corresponds to a equal-length
segment pair involving two of the input sequences,
coordinates for anchor points are defined as follows:
(1) first sequence involved
(2) second sequence involved
(3) start of anchor in first sequence
(4) start of anchor in second sequence
(5) length of anchor.
(6) specify a score of an anchor point.
This score is necessary to prioritize anchor point in case they
are inconsistent with each other, i.e., if not all of them can be
used simultaneously for the same alignment. the above
coordinates (in the above given order).
32
1
2
4
6
6
Thus, the above two anchor points are specified as follows:
1 2 4 6 6 4.5
2 3 6 11 8 1.3
4.5 and 1.3 are (arbitrary) scores
for anchor points 1 and 2.
33
MSA Approaches
• Progressive approach:
CLUSTALW (CLUSTALX)
T-COFFEE
PILEUP
• Iterative approach (Repeatedly realigned subsets
of sequences):
MultAlin, DiAlign
• Statistical Methods (e.g., Hidden Markov Models)
SAM2K
• Genetic algorithms
SAGA
34
T-coffee
• An MSA program
• Uses principles similar to Clustal
• Combining sequence alignment method
with structural alignment techniques
• More accurate but longer running time
• Limits to the number of sequences it can
align (~100)
35
MSA: Problems
Giddy Landan and Dan Graur. 2007. Heads or Tails: A Simple
Reliability Check for Multiple Sequence Alignments. Molecular Biology
and Evolution 24(6):1380-1383.
36
Genomic alignment (with MAUVE)