Transcript Document

Multiple Sequence Alignment
Multiple Sequence Alignment
VTISCTGSSSNIGAGNHVKWYQQLPG
VTISCTGTSSNIGSITVNWYQQLPG
LRLSCSSSGFIFSSYAMYWVRQAPG
LSLTCTVSGTSFDDYYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG
ATLVCLISDFYPGAVTVAWKADS
ATLVCLISDFYPGAVTVAWKADS
AALGCLVKDYFPEPVTVSWNSG-
VSLTCLVKGFYPSDIAVEWESNG-
• Goal: Bring the greatest number of similar characters into the same column of the
alignment. (Similar to alignment of two sequences)
• Definition: In an MSA, homologous residues among a set of sequences are
aligned together in columns. Homologous is meant in both the structural and
evolutionary sense. Ideally, a column of aligned residues occupy similar threedimensional structural positions and all diverge from a common ancestral residue.
• Except for trivial cases of highly identical sequences, it is not possible to
unambiguously identify structurally or evolutionarily homologous positions and
create a single correct multiple alignment.
Multiple Sequence Alignment: Motivation
•
•
•
•
•
Correspondence. Find out which parts “do the same thing”
– Similar genes are conserved across widely divergent species, often
performing similar functions
Structure prediction
– Use knowledge of structure of one or more members of a protein MSA
to predict structure of other members
– Structure is more conserved than sequence
Create “profiles” for protein families
– Allow us to search for other members of the family
Genome assembly: Automated reconstruction of “contig” maps of genomic
fragments such as ESTs
MSA is the starting point for phylogenetic analysis
Multiple Sequence Alignment: Approaches
•
Optimal Global Alignments -Dynamic programming
– Generalization of Needleman-Wunsch
– Find alignment that maximizes a score function
– Computationally expensive: Time grows as product of sequence lengths
•
Global Progressive Alignments - Match closely-related sequences first using a
guide tree
•
Global Iterative Alignments - Multiple re-building attempts to find best alignment
•
Local alignments
– Profiles, Blocks, Patterns
Sum-of-Pairs/SP Score Function
Score of multiple alignment
= ∑i <j score(Si,Sj)
where
score(Si,Sj) = score of induced pair-wise alignment
QuickTime™ and a
decompressor
are needed to see this picture.
Materials & Methods
•
•
•
Group of related proteins & how to find them: Protein sequences can be related
by homology (common ancestor) or convergence (proteins independently evolve to
have common sequence features that typically perform a common function or have a
common structure).
How many sequences are needed?: More is better.
Redundancy: Redundant sequences will bias the alignment toward their own
features. A good rule of thumb in cases of varied similarities is to have only a single
sequence from each set with more than 70-80% intra-sequence similarity.
MSA: Different Methods
•
•
•
•
•
Exact methods
Progressive (ClustalW)
Iterative (MUSCLE)
Consistency (ProbCons)
Structure-based (Expresso)
Progressive Multiple Sequence
Alignment
• Compare all sequences pairwise.
• Perform cluster analysis on the pairwise data to generate a hierarchy
for alignment. This may be in the form of a binary tree or a simple
ordering
• Build the multiple alignment by first aligning the most similar pair of
sequences, then the next most similar pair and so on. Once an
alignment of two sequences has been made, then this is fixed. Thus
for a set of sequences A, B, C, D having aligned A with C and B with D
the alignment of A, B, C, D is obtained by comparing the alignments of
A and C with that of B and D using averaged scores at each aligned
position.
Steps in Multiple Alignment
CLUSTALW MSA
MSA of four oxidoreductase NAD binding domain protein sequences.
Red: AVFPMILW. Blue: DE. Magenta: RHK. Green: STYHCNGQ. Grey:
all others. Residue ranges are shown after sequence names.
Multiple sequence alignment methods
Iterative methods: compute a sub-optimal solution and
keep modifying that intelligently using dynamic
programming or other methods until the solution
converges.
Examples: MUSCLE, IterAlign, Praline, MAFFT
MUSCLE: next-generation progressive MSA
[1] Build a draft progressive alignment
•
Determine pairwise similarity through k-mer
counting (not by alignment)
•
Compute distance (triangular distance) matrix
•
Construct tree using UPGMA
•
Construct draft progressive alignment following
tree
Page 191
MUSCLE: next-generation progressive MSA
[2] Improve the progressive alignment
•
Compute pairwise identity through current MSA
•
Construct new tree with Kimura distance measures
•
Compare new and old trees: if improved, repeat this
step, if not improved, then we’re done
Page 191
MUSCLE: next-generation progressive MSA
[3] Refinement of the MSA
•
Split tree in half by deleting one edge
•
Make profiles of each half of the tree
•
Re-align the profiles
•
Accept/reject the new alignment
MUSCLE: next-generation progressive MSA
[3] Refinement of the MSA
•
Split tree in half by deleting one edge
•
Make profiles of each half of the tree
•
Re-align the profiles
•
Accept/reject the new alignment
Iterative approaches: MAFFT
•Uses Fast Fourier Transform to speed up profile
alignment
•Uses fast two-stage method for building alignments
using k-mer frequencies
•Offers many different scoring and aligning techniques
•One of the more accurate programs available
•Available as standalone or web interface
•Many output formats, including interactive phylogenetic
trees
Page 190
MSA: Consistency Approach
•
Consistency-based algorithms: generally use a
database of both local high-scoring alignments and
long-range global alignments to create a final
alignment
•
These are very powerful, very fast, and very
accurate methods
•
Examples: T-COFFEE, Prrp, DiAlign, ProbCons
Page 192
ProbCons—consistency-based approach
•Combines iterative and progressive approaches with
a unique probabilistic model.
•Uses Hidden Markov Models to calculate probability
matrices for matching residues, uses this to construct
a guide tree
•Progressive alignment hierarchically along guide tree
•Post-processing and iterative refinement (a little like
MUSCLE)
Structure Based MSA
•

Structure based alignment is more informative and give us more efficient information
about Proteins.
o 3D structure is more conserved then simple amino acid sequence.
o Tertiary protein structure evolves more slowly than primary structure. So it is
possible to improve the accuracy of alignment by including information about the
three dimensional structure of protein.
The structure of Protein is very complex. So Protein structural alignment is more time
consuming and complicated.
Structure Based MSA: 3D-Coffee and Expresso
•
•
3D-Coffee is a special mode of T-Coffee that uses structural information. In 3DCoffee the original sequences are replaced by homologous structures (templates).
The templates are aligned using strutural aligners like sap or TMalign and these
structure based alignments are used as a template for realigning the original
sequences. Provided the right templates have been found, the result is a structure
based multiple sequence alignment. If some of the sequences do not have any
template, they are treated like regular sequences and aligned with the appropriate
methods.
Expresso is an extension of 3D-Coffee where the structural templates are
automatically identified with BLAST.
QuickTime™ and a
decompressor
are needed to see this picture.
Structure Based MSA: 3D-Coffee and Expresso
Protein Data Bank accession numbers
Two Extremes of MSA
• Recently diverged sequences from each other did not change much.
Hence these will be very similar across the length. So, we cannot tell
whether distant regions that are conserved due to their importance
from regions that did not have time to diverge.
• Sequences that are very diverged from each other (or perhaps are
not related to begin with) and have no similar sequence regions. It
still is possible that the proteins are related and have corresponding
regions, but we cannot identify these by their sequence. This can
change if we have more data (more sequences and/or experimental
data) or use better alignment methods.