Transcript Slide 1

Chapter 5
Multiple Sequence Alignment
•Multiple alignment is an extension of pairwise alignment where
multiple sequences are aligned
•This alignment provides insights not possible in pairwise alignments,
such as
•Conserved sequence patterns
•Conserved and functionally critical amino acid residues
•Prerequisite for phylogenetic analyses
•Prediction of protein secondary and tertiary structures
•Design of degenerate PCR primers
Scoring Function
•The purpose of multiple alignment is to line up sequences in a way so that
a maximum number of residues from each sequence are matched
according to a scoring function
•The scoring function is generally based on “sum of pairs” (SP)
•The SP is the sum of all pairwise scores for all residues in the alignment
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
C
9
-1
-1
-3
0
-3
-3
-3
-4
-3
-3
-3
-3
-1
-1
-1
-1
-2
-2
-2
S
-1
4
1
-1
1
0
1
0
0
0
-1
-1
0
-1
-2
-2
-2
-2
-2
-3
T
-1
1
4
1
-1
1
0
1
0
0
0
-1
0
-1
-2
-2
-2
-2
-2
-3
P
-3
-1
1
7
-1
-2
-2
-1
-1
-1
-2
-2
-1
-2
-3
-3
-2
-4
-3
-4
A
0
1
-1
-1
4
0
-2
-2
-1
-1
-2
-1
-1
-1
-1
-1
0
-2
-2
-3
G
-3
0
1
-2
0
6
0
-1
-2
-2
-2
-2
-2
-3
-4
-4
-3
-3
-3
-2
N
-3
1
0
-1
-1
-2
6
1
0
0
1
0
0
-2
-3
-3
-3
-3
-2
-4
D
-3
0
1
-1
-2
-1
1
6
2
0
1
-2
-1
-3
-3
-4
-3
-3
-3
-4
E
-4
0
0
-1
-1
-2
0
2
5
2
0
0
1
-2
-3
-3
-2
-3
-2
-3
Q
-3
0
0
-1
-1
-2
0
0
2
5
0
1
1
0
-3
-2
-2
-3
-1
-2
H
-3
-1
0
-2
-2
-2
-1
-1
0
0
8
0
-1
-2
-3
-3
-3
-1
2
-2
R
-3
-1
-1
-2
-1
-2
0
-2
0
1
0
5
2
-1
-3
-2
-3
-3
-2
-3
K
-3
0
0
-1
-1
-2
0
-1
1
1
-1
2
5
-1
-3
-2
-2
-3
-2
-3
M
-1
-1
-1
-2
-1
-3
-2
-3
-2
0
-2
-1
-1
5
1
2
1
0
-1
-1
I
-1
-2
-2
-3
-1
-4
-3
-3
-3
-3
-3
-3
-3
1
4
2
3
0
-1
-3
L
-1
-2
-2
-3
-1
-4
-3
-4
-3
-2
-3
-2
-2
2
2
4
1
0
-1
-2
V
-1
-2
-2
-2
-2
0
-3
-3
-3
-2
-2
-3
-3
-2
1
3
4
-1
-1
-3
Blosum62 substitution matrix
F
-2
-2
-2
-4
-2
-3
-3
-3
-3
-3
-1
-3
-3
0
0
0
-1
6
3
1
Y
-2
-2
-2
-3
-2
-3
-2
-3
-2
-1
2
-2
-2
-1
-1
-1
-1
3
7
2
W
-2
-3
-3
-4
-3
-2
-4
-4
-3
-2
-2
-3
-3
-1
-3
-2
-3
1
2
11
Sequence 1:
Sequence 2:
Sequence 3:
G:T = 1
T:S = 1
G:S = 0
Total:2
G K N
T R N
S H E
K:R=2
R:H=0
K:H=-1
+
1
N:N=6
N:E=0
N:E=0
+ 6 = 9
Thus 29 = 512 times more likely
than by random chance
Exhaustive Algorithms
Brute Force Algorithm
Similar to dynamic programming algorithms that searches for the best
solution, examining every possible solution
In pairwise alignment use a 2D matrix
For N sequences, use an N-dimensional matrix
Number of calculations increase exponentially (N×N×N×N×…)
Generally only useful for <=10 short sequences
Divide and Conquer Alignment (DCA)
Identify regional similarities in multiple sequences
Do a brute force alignment of the similar regions
Join the independently aligned regions
http://bibiserv.techfak.uni-bielefeld.de/dca/
Heuristic Algorithm
Progressive Alignment Method
•Pairwise alignment by Needleman-Wunsch of all pairs
•Records similarity scores of aligned pairs
•Scores entered into matrix
•Guide tree constructed that reflects similarity between aligned pairs
•Most closely related sequences re-aligned with Needleman-Wunsch
•Different substitution matrices are selected depending on
evolutionary distance between sequences to be aligned
•Aligned pair converted to “consensus sequence” with fixed gaps
•Consensus sequences treated as ordinary sequence for next step
which is pairwise alignment with most related sequence in guide tree
•Next “consensus sequence” is calculated and process repeated until
all sequences are aligned
•Most famous: clustalW (command line) clustalX (GUI)
•http://www.ebi.ac.uk/Tools/clustalw2/index.html
Download and install clustW from
ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.0.9/
Spend a few minutes entering sequences and doing alignments
•ClustalW uses gap penalties that is context sensitive:
•Gaps count more close to runs of hydrophobic amino acids
(more likely to be in internal conserved regions of a protein)
compared to next to hydrophilic regions or G, likely to be on the
outside in loops
•Weighing scheme: closely related sequences are given a lower
weighting score
•The weighting score is dependent upon the branch length
divided by the number of shared branches
•This has the effect of minimizing a possible dominating effect of
common sequences
Drawbacks and Solutions
•Based on global alignment – thus only sequences of similar length
can be aligned
•Long gaps required for alignment of dissimilar sequence length
penalized
•“Greedy” algorithm – once gaps are introduced, they stay in
subsequence consensus sequences
T-Coffee
•Tree-based Consistency Objective Function for alignment Evaluation
•http://www.ebi.ac.uk/Tools/t-coffee/
•http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi
•Performs global alignment with clustal
•Local pairwise alignment with Lalign
•Global and ten best local alignments are pooled to form a library
•All pairwise alignments are then aligned with a third possible sequence
•Distance matrix calculated to build a guide tree
•Guide tree used for final multiple alignment
•Does not get” stuck” in sub-optimal initial alignments
•Slower than clustal
dbClustal
•First performs BLASTP search for a query sequence
•Aligned pairs are analyzed to obtain anchor points (local conserved
regions) using a program called Ballast
•Global alignment generated by Clustal, weighed to anchor points
•Initial local alignment minimizes errors in divergent sequences
•Multiple alignment subsequently evaluated by NorMD which removes
poorly aligned sequences
•http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbClustal+noid
Partial Order Alignment (POA)
•http://bioinformatics.ucla.edu/poa/
•Multiple alignments performed on more and more sequences
from a list
•Identical residues condensed to nodes
•Each new sequence aligned with each sequence of the graph
model
•Eliminates the problem of error fixation
•Faster and more accurate than clustal
PRALINE
•http://zeus.cs.vu.nl/programs/pralinewww/
•Builds profiles of sequences to be aligned
•Profiles generated by PSI-BLAST
•Because profiles contain information on close relatives, divergent
sequences are more accurately aligned
•Program can incorporate secondary protein structure
•Very sophisticated but very slow
Iterative Alignment
PRRN
•Find optimal solution by iteratively modifying sub-optimal solutions
•http://prrn.ims.u-tokyo.ac.jp/
•Multiple alignment is performed on whole group of sequences
•Sequences randomly distributed into two groups
•Dynamic programming applied to consensus sequences derived
from each group
•The random split is repeated and another round of dynamic
programming alignment performed
•This is repeated until the alignment score no longer increases
•A multiple alignment of the sequences are then again performed
•Process repeated until multiple alignment score no longer improves
Iterative Alignment
DIALIGN2
•http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=dialign
•Breaks all sequences down into segments, and performs alignment
between segments
•High-scoring segments are progressively assembled into larger and
larger sequences
•The score of an alignment is calculated from the block and not from
individual residues
•Sequence regions between block are left unaligned
•Very suited to alignment of divergent sequences
Practical Issues
•DNA alignments are only based on 4 nucleotides, and are less reliable
than protein sequence alignments
•Alignments of DNA sequence does not consider functional issues,
suchas gene boundaries
•Insertion of gaps may “break” codons or cause frameshift that will not
be tolerated in the protein, and is functional nonsense
•Thus, always better toalign protein sequences
•Possible to convert DNA to amino acid sequence, then align, and then
decode back to DNA
•RevTrans (http://www.cbs.dtu.dk/services/RevTrans/)
•PROTA2DNA (missing link…)
Editing and Format
•Most alignment programs require final editing by a human to ensure
that there are no problems in functionality
•Finding badly aligned regions
•Removing non-sensical gaps etc.
•http://www.mbio.ncsu.edu/bioEdit/bioedit.html
•Need to convert one sequence format to another:
http://iubio.bio.indiana.edu/cgi-bin/readseq.cgi/