Multiple Alignment

Transcript Multiple Alignment

Multiple
Alignment
Stuart M. Brown
NYU School of Medicine
Pairwise Alignment


The alignment of two sequences
(DNA or protein) is a relatively
straightforward computational
problem.
The best solution seems to be an
approach called Dynamic
Programming.
Dynamic Programming


Dynamic Programming is a very general
programming technique.
It is applicable when a large search space
can be structured into a succession of
stages, such that:
 the initial stage contains trivial solutions to
sub-problems
 each partial solution in a later stage can
be calculated by recurring a fixed number
of partial solutions in an earlier stage
 the final stage contains the overall
solution
Global vs. Local
Alignments


Global alignment algorithms start
at the beginning of two sequences
and add gaps to each until the end
of one is reached.
Local alignment algorithms finds
the region (or regions) of highest
similarity between two sequences
and build the alignment outward
from there.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
GAP




The GCG program GAP implements the Needleman
and Wunsch Global alignment algorithm.
Global algorithms are often not effective for highly
diverged sequences and do not reflect the biological
reality that two sequences may only share limited
regions of conserved sequence.
Sometimes two sequences may be derived from
ancient recombination events where only a single
functional domain is shared.
GAP is useful when you want to force two sequences
to align over their entire length
BESTFIT



The GCG program BESTFIT implements
the Smith-Waterman local alignment
algorithm.
FASTA and BLAST are local alignment
algorithms
NCBI has a “BLAST 2 Sequences”
feature on its website:
http://www.ncbi.nlm.nih.gov/gorf/bl2.html
Pairwise Alignment
on the Web

The ALIGN global alignment program is
available at several servers:
http://molbiol.soton.ac.uk/compute/align.html
http://www2.igh.cnrs.fr/bin/align-guess.cgi

LALIGN local alignment program is
available at several servers:
http://www2.igh.cnrs.fr/bin/lalign-guess.cgi
http://www.ch.embnet.org/software/LALIGN_form.html

LFASTA uses FASTA for local alignment of
2 sequences:
http://pbil.univ-lyon1.fr/lfasta.html

BLAST 2 Sequences (NCBI)
http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html
Multiple Alignments


In theory, making an optimal alignment
between two sequences is
computationally straightforward (SmithWaterman algorithm), but aligning a
large number of sequences using the
same method is almost impossible.
The problem increases exponentially
with the number of sequences involved
(the product of the sequence lengths)
Optimal Alignment


For a given group of sequences,
there is no single "correct"
alignment, only an alignment that
is "optimal" according to some set
of calculations.
Determining what alignment is best
for a given set of sequences is
really up to the judgement of the
investigator.
Progressive Pairwise
Methods


Most of the available multiple
alignment programs use some sort
of incremental or progressive
method that makes pairwise
alignments, then adds new
sequences one at a time to these
aligned groups.
This is an approximate method!
PILEUP


PILEUP is the multiple alignment
program in the GCG package
CLUSTAL is another popular
program (also available on the RCR server)
that uses a similar algorithm.
The PILEUP Algorithm




First, PILEUP calculates approximate pairwise
similarity scores between all sequences to be
aligned, and they are clustered into a
dendrogram (tree structure).
Then the most similar pairs of sequences are
aligned.
Averages (similar to consensus sequences)
are calculated for the aligned pairs.
New sequences and clusters of sequences are
added one by one, according to the branching
order in the dendrogram.
PILEUP Considerations




Since the alignment is calculated on a
progressive basis, the order of the initial
sequences can affect the final alignment.
PILEUP paramaters: 2 gap penalties (gap
insert and gap extend) and an amino acid
comparison matrix.
PILEUP will refuse to align sequences that
require too many gaps or mismatches.
PILEUP will take quite a while to align more
than about 10 sequences
Instructions for
running PILEUP



PILEUP uses a list of sequence files
as input
You can use output from a FASTA or
LOOKUP search as a list or make your
own list in a text editor
A list file can include files from your
own directory and/or GCG database
files.
LIST file format

List files always begin with two dots ..
..
gp:S31321
gp:Yno3_Yeast
S51900.pep
Yan2_Schpo
Ypd1_Caeel
A36205
Mpp1_Rat begin:100 end:345
B46665.pep
Ymxg_Bacsu begin:150 end:464
A48043.pep

List files can also include Begin and End
positions within a sequence
PILEUP @myseqs.list



Now at the > prompt, type PILEUP
and the name of the file that is your
list of sequence names.
However, GCG requires that you
must precede the name of your list
file with the @ character.
So the command looks like this:
> PILEUP @myseqs.list
PILEUP Output
> more myseqs.msf
Hsirf2
Muirf2
Chirf2
Muirf1
Ratirf1
Hsirf1
Chkirf1a
Hsirf3a
Mmuirf3
Hsirf5
Mmuirf6
Hump48
Mup48
Hsirf4
Mupip
Huicsbp
Muicsbp
Chkicsbp
1501
SERPSKKGKK
SERPSKKGKK
SERPSKKGKK
LTRNQRKERK
LTKNQRKERK
LTKNQRKERK
LTKDQKKERK
~~~~~~~~~~
~~~~~~~~~~
GPAPTDSQPP
IPQPQGS.VI
...PPGIVSG
...PAGTLPN
...PEGAKKG
...PEGAKKG
...PEEDQK.
...PEEEQK.
...PEEEQK.
PKTEKEDKVK
PKTEKEERVK
TKSEKDDKFK
SKSSRDTKSK
SKSSRDTKSK
SKSSRDAKSK
SKSSREARNK
~~~~~~~~~~
~~~~~~~~~~
EDYSFGAGEE
NPGSTGSAPW
QPGTQKVPSK
QPRNQKSPCK
AKQLTLEDPQ
AKQLTLDDTQ
..........
..........
..........
HIKQEPVESS
HIKQEPVESS
QIKQEPVESS
TKRKLCGDVS
TKRKLCGDSS
AKRKSCGDSS
SKRKLYEDMR
~~~~~~~~~~
~~~~~~~~~~
EEEEEELQRM
DEKDNDVDED
RQHSSVSSER
RSISCVSPER
MSMSHPYTMT
MAMGHPYPMT
..........
..........
..........
LGLSNGVSDL
LGLSNGVSGF
FGI.NGLNDV
PDTFS..DGL
PDTLS..DGL
PDTFS..DGL
MEESA..ERL
~~~~~~~~~~
~~~~~~~~~~
LPSLSLTDAV
EEEDELEQSQ
KEEEDAMQNC
EEN...MENG
TPYPSLPA.Q
APYGSLPAQQ
CKLGVATAGC
CKLGVAPAGC
CKIGVGNGSS
1550
SPEYAVLTST
SPEYAVLTSA
TSDY.FLSSS
SSSTLPDDHS
SSSTLPDDHS
SSSTLPDDHS
TSTPLPDDHS
~~~~~~~~~~
~~~~~~~~~~
QSGPHMTPYS
HHVPIQDTFP
TLSPSVLQDS
RTNGVVNHSD
VHNYMMPPLD
VHNYMMPPHD
VNEVTEMECG
MSEVPEMECG
LTDVGDMDCS
Hsirf2
Muirf2
Chirf2
Muirf1
Ratirf1
Hsirf1
Chkirf1a
Hsirf3a
Mmuirf3
Hsirf5
Mmuirf6
Hump48
Mup48
Hsirf4
Mupip
Huicsbp
Muicsbp
Chkicsbp
1551
IKNEVDSTVN
IKNEVDSTVN
IKNEVDSTVN
SYTTQGYLGQ
SYTAQGYLGQ
SYTVPGYM.Q
SYTAHDYTGQ
~~~~~~~~~~
~~~~~~~~~~
LLKEDVKWPP
FL........
LNNEEEGASG
SGSNIGGGGN
RSWRDYVPDQ
RSWRDYAPDQ
RSEIDELIKE
RSEIEELIKE
PSAIDDLMKE
IIVVGQSHLD
IIVVGQSHLD
IVVVGQPHLD
DLDMER.DIT
DLDMDR.DIT
DLEVEQ.ALT
EVEVENTSIT
~~~~~~~~~~
~~~~~~~~~~
TLQPPTLQPP
NINGSPMAPA
GAVHSDIGSS
GSNRSD...S
PHPEIPYQCP
SHPEIPYQCP
.PSVDDYMGM
.PSVDEYMGM
PPCVDEYLGI
SNIENQEIVT
SNIEDQEIVT
GSSEEQVIVA
PALSPCVVSS
PALSPCVVSS
PALSPCAVSS
LDLSSCEVSG
~~~~~~~~~~
~~~~~~~~~~
VVLGPPAPDP
SVGNCSVGNC
SSSSSPEPQE
NSNCNSELEE
MTFGPRGHHW
VTFGPRGHHW
IKRSPSP...
TKRSPSP...
IKRSPSP...
NPPDICQVVE
NPPDICQVVE
NPPDVCQVVE
SLSEWHMQMD
SLSEWHMQMD
TLPDWHIPVE
SLTDWRMPME
~~~~~~~~~~
~~~~~~~~~~
SPLAPPPGNP
SPESVWP...
VTDTTEAPFQ
GAGTTEATIR
QGPACENGCQ
QGPSCENGCQ
P.DACRS..Q
P.EACRS..Q
PQETCRN..P
1600
VTTESDEQPV
VTTESDDQPV
VTTESDEQPL
I.IPDSTTDL
I.MPDSTTDL
V.VPDSTSDL
IAMADSTNDI
~~~~~~~~~~
~~~~~~~~~~
AGFRELLSEV
......KTEP
........GD
........ED
VTGTFYACAP
VTGTFYACAP
LLPDWWAHEP
ILPDWWVQQP
PIPDWWMQQP
PILEUP options


For a first try, take the default options,
but give the output file a meaningful
name.
If you don’t get a good alignment, try a
less stringent matrix and/or gap
penalties.
> PILEUP -matr=oldpep.cmp

It is a good idea to run PILEUP in batch
mode if you have more than 10
sequences to align:
> PILEUP -bat
CLUSTAL

CLUSTAL is a stand-alone (i.e. not
integrated into GCG) multiple alignment
program that is superior in some respects to
PILEUP
Gap penalties can be adjusted based on
specific amino acid residues, regions of
hydrophobicity, proximity to other gaps, or
secondary structure.
 it can re-align just selected sequences or
selected regions in an existing alignment
 It can compute phylogenetic trees from a set
of aligned sequences.


There are also Mac and PC versions with a
nice graphical interface (CLUSTALX).
Using CLUSTAL


On mcrcr0 type: clustal
CLUSTAL can only work with
sequences in multi-sequence
FASTA format.
 The
GCG program TOFASTA can
convert lists of file names into FASTA
multi-sequence format.
Multiple Alignment
tools on the Web



There are a variety of multiple
alignment tools available for free on
the web.
CLUSTAL is available from a number
of sites (with a variety of restrictions)
Other algorithms are available too
 Watch
out for “experimental” algorithms;
there may be a good reason why you have
never heard of some oddball program
Some URLs

EMBL-EBI
http://www.ebi.ac.uk/clustalw/

BCM Search Launcher: Multiple
Alignment
http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

Multiple Sequence Alignment for
Proteins (Wash. U. St. Louis)
http://www.ibc.wustl.edu/service/msa/
Editing Multiple
Alignments



There are a variety of tools that can be
used to modify a multiple alignment.
These programs can be very useful in
formatting and annotating an alignment
for publication.
An editor can also be used to make
modifications by hand to improve
biologically significant regions in a
multiple alignment created by one of
the automated alignment programs.
GCG alignment
editors



Alignments produced with PILEUP
(or CLUSTAL) can be adjusted
with LINEUP.
Nicely shaded printouts can be
produced with PRETTYBOX
GCG's SeqLab X-Windows
interface has a superb multiple
sequence editor - the best editor
of any kind.
Other editors


The MACAW and SeqVu program for
Macintosh and GeneDoc and DCSE
for PCs are free and provide
excellent editor functionality.
Many “comprehensive” molecular
biology programs include multiple
alignment functions:

MacVector, OMIGA, Vector NTI, and
GeneTool/PepTool all include a built-in
version of CLUSTAL
SeqVu
Editors on the Web

Check out CINEMA (Colour
INteractive Editor for Multiple
Alignments)
 It
is an editor created completely in
JAVA (old browsers beware)
 It includes a fully functional version
of CLUSTAL, BLAST, and a
DotPlot module
http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/

Multiple Alignment

Transcript Multiple Alignment

Directory