Multiple Alignment

Download Report

Transcript Multiple Alignment

Previous Lecture
Hypothesis Tesing 2: Comparing
Samples
Multiple
Alignment
Stuart M. Brown
NYU School of Medicine
Learning Objectives






Understand the need for multiple alignment
methods in biology
Optimal methods (dynamic programming) are
not practical to align many sequences
Progressive pairwise approach
Profile alignments
Editing alignments
Sequence Logos
Reasons for aligning
sets of sequences








Organize data to reflect sequence homology
Estimate evolutionary distance
Infer phylogenetic trees from homologous sites
Highlight conserved sites/regions (motifs)
Highlight variable sites/regions
Uncover changes in gene structure
Look for evidence of selection
Summarize information
Pairwise Alignment


The alignment of two sequences (DNA or
protein) is a relatively straightforward
computational problem.
The best solution seems to be an approach
called Dynamic Programming.
Dynamic Programming


Dynamic Programming is a general
programming technique.
It is applicable when a large search space
can be structured into a succession of
stages, such that:
 the initial stage contains trivial solutions to
sub-problems
 each partial solution in a later stage can
be calculated by recurring a fixed number
of partial solutions in an earlier stage
 the final stage contains the overall
solution
Multiple Alignments


Making an optimal alignment between two
sequences is computationally straightforward, but
aligning a large number of sequences using the
same method is almost impossible.
The problem increases exponentially with the
number of sequences involved, so it becomes
computationally expensive (and inefficient) for
large numbers of sequences.
Longer Sequences
A
G
T
A
G
T
A
G
-1
-1
-2
G
-1
-1
-2
?
T
-2
-2
-1
T
-2
-2
-1
?
A
-2
-3
-3
A
-2
-3
-3
?
C
?
?
?
?

What happens to the number of cells in the matrix when we add another
base to one sequence?
How about to both?

# cells = L1 x L2


or L2 if we use 2 sequences of the same length.
So the amount of computing grows with the square of seq. length – bad but
not terrible, because the compute time for each cell remains constant
Align Three Sequences by
Dynamic programming
Georg Fullen, VSNS Biocomputing,
Univ. Munster
So how many cells (that contain values that must be computed) do we add for each additional
sequence – it’s a power function! For N sequences of length L: # of cells = 2n x Ln
This is very bad for computing alignments of a lot of sequences!
If the calculation takes 1 nanosecond per cell, then for 6 sequences of length 100, we'll have a
running time of is 26 x 1006 x 10-9 seconds (64000 seconds). Just add 2 more sequences, and
the running time is 28 x 1008 x 10-9 = 2.6 x 109 seconds (~28 days)
Global vs. Local Multiple
Alignments


Global alignment algorithms start at the beginning of two
sequences and add gaps to each until the end of one is
reached.
Local alignment algorithms finds the region (or regions)
of highest similarity between two sequences and build
the alignment outward from there. Creates inconsistent
gap regions between aligned blocks
Optimal Alignment


For a given group of sequences, there is
no single "correct" alignment, only an
alignment that is "optimal" according to
some set of calculations.
Determining what alignment is best for a
given set of sequences is really up to the
judgment of the investigator.
Progressive Pairwise Methods


Most of the available multiple alignment programs use
some sort of incremental or progressive method that
makes pairwise (global) alignments, averages them into
a consensus (actually a profile), then adds new
sequences one at a time to the aligned set.
This is an approximate method!
Heuristic
 Perform quick pairwise alignments, score all similarities, build
a distance tree
 Align first pair of sequences (most similar pair)
 Build a profile of aligned sequences
 Align each new sequence to profile, rebuild profile
 Do the progressive alignments in a sensible order

Profile Alignment




Can represent two (or more) aligned sequences
as the frequency of each letter at each position.
Can slide a new sequence along this profile and
calculate a similarity score at each position
using a score function that gives value for a
match equal to the weighted frequency of that
letter in the profile.
Very similar to using a lookup table (PAM or
BLOSSUM) for amino acid similarities
Can use the same method to align two profiles
with each other
CLUSTAL

CLUSTAL is the most popular multiple alignment program
Gap penalties can be adjusted based on specific amino acid
residues, regions of hydrophobicity, proximity to other gaps, or
secondary structure.
 it can re-align just selected sequences or selected regions in an
existing alignment
 It can compute phylogenetic trees from a set of aligned
sequences.
Unix command line program
Website: http://www.ebi.ac.uk/Tools/clustalw2/index.html




There are also Mac and PC versions with a nice graphical
interface (CLUSTALX).
Clustal Algorithm



Perform pairwise alignments and calculate
distances for all pairs of sequences
Construct guide tree (dendrogram) joining the
most similar sequences using Neighbour Joining
Align sequences, starting at the leaves of the
guide tree. This involves the pair-wise
comparisons as well as comparison of single
sequence with a group of seqs (profile)
CLUSTALW2 at the EBI website
http://www.ebi.ac.uk/Tools/clustalw2/index.html
(now replaced by Clustal Omega)
Clustal Parameters






Scoring Matrix
Gap opening penalty
Gap extension penalty
Protein gap parameters
Additional algorithm parameters
Secondary structure penalties
Score Matrices



Pairwise matrices and multiple alignment matrix
series
For Proteins: PAM (Dayhoff), BLOSUM
(Hennikof), GONNET (default), user defined
Transition (A<->G) weight (zero in clustal means
transitions scored as mismatch – one means
transition scored as match) – should be low for
distantly related sequences
Gap Penalties

Linear gap penalties – Affine gap penalties
p = (o + l.e)



Gap opening /Gap extension
Penalized multiple nearby gaps
Protein specific penalties (on by default)
 Increase
the probability of gaps associated with
certain residues
 Increase the chances of gaps in loop regions
(> 5 hydrophilic residues)
Algorithm parameters




Slow-accurate pair-wise alignment
Do alignment from guide tree
Reset gaps before aligning
(iteration)
Delay divergent sequences (%)
Additional displays



Column Scores
Low quality regions
Exceptional residues


ClustalX is not optimal
There are known areas in which ClustalX
performs badly e.g.
 errors
introduced early cannot be corrected by
subsequent information
 alignments of sequences of differing lengths
cause strange guide trees and unpredictable
effects
 edges: ClustalX does not penalise gaps at
edges

There are alternatives to ClustalX available
Other Multiple Alignment Tools

MUSCLE
http://www.ebi.ac.uk/Tools/muscle/index.html
(builds progressive alignment, then improves by
additional re-alignment of problem pairs)

TCOFFE
http://www.ebi.ac.uk/Tools/t-coffee/
(Uses both local and global pairwise alignments –
SLOWER!)

MSA
Multiple Alignment Tips







Align pairs of sequences using an optimal method
Progressive alignment programs such as ClustalX
for multiple alignment
Choose representative sequences to align
carefully
Choose sequences of comparable lengths
Progressive alignment programs may be
combined
Review alignment by eye and edit
If you have a choice align amino acid sequences
rather than nucleotides
Alignment of coding
regions


Nucleotide sequences are much harder to align
accurately than proteins
Protein coding sequences can be aligned using the
protein sequences
 e.g.
BioEdit: toggle translation to amino acid, call clustalw
to align, edit alignment by hand, toggle back to
nucleotide

In-frame nucleotide alignments can be used, e.g. to
determine non-synonymous and synonymous
distances separately
Editing Multiple
Alignments




There are a variety of tools that can be used to modify
and display a multiple alignment.
These programs can be very useful in formatting and
annotating an alignment for publication.
An editor can also be used to make modifications by
hand to improve biologically significant regions in a
multiple alignment created by an alignment program.
Many different file formats exist for alignments:
 Clustal, Phylip, MSF, MEGA
Consensus
Sequences

Y
Y
F
F
Y
Y
Simplest Form:
A single sequence which represents the most
common amino acid/base in that position
D
D
E
D
E
D
D
G
G
G
G
G
G
G
G
G
G
A
I
I
A
A/I
V
L
L
V
V/L
V
V
V
V
E
E
E
Q
Q
E
A
A
A
A
A
A
L
L
L
V
L
L
Clustal Format
CLUSTAL X (1.81) multiple sequence alignment
CAS1_BOVIN
CAS1_SHEEP
CAS1_PIG
CAS1_HUMAN
CAS1_RABBIT
CAS1_MOUSE
CAS1_RAT
MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNENMKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNENMKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE-------MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE-------MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERK
MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSE
MKLLILTCLVAAALALPRAHRRNAVSSQTQ------------*:***: **.*.*:* :
.
:
Phylip Format
(Interleaved)
7
100
SOMA_BOVIN
SOMA_SHEEP
SOMA_RAT_P
SOMA_MOUSE
SOMA_RABIT
SOMA_PIG_P
SOMA_HUMAN
MMAAGPRTSL
MMAAGPRTSL
-MAADSQTPW
-MATDSRTSW
-MAAGSWTAG
-MAAGPRTSA
-MATGSRTSL
LLAFALLCLP
LLAFTLLCLP
LLTFSLLCLL
LLTVSLLCLL
LLAFALLCLP
LLAFALLCLP
LLAFGLLCLP
WTQVVGAFPA
WTQVVGAFPA
WPQEAGAFPA
WPQEASAFPA
WPQEASAFPA
WTREVGAFPA
WLQEGSAFPT
MSLSGLFANA
MSLSGLFANA
MPLSSLFANA
MPLSSLFSNA
MPLSSLFANA
MPLSSLFANA
IPLSRLFDNA
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
MLRAHRLHQL
AADTFKEFER
AADTFKEFER
AADTYKEFER
AADTYKEFER
AADTYKEFER
AADTYKEFER
AFDTYQEFEE
TYIPEGQRYS
TYIPEGQRYS
AYIPEGQRYS
AYIPEGQRYS
AYIPEGQRYS
AYIPEGQRYS
AYIPKEQKYS
-IQNTQVAFC
-IQNTQVAFC
-IQNAQAAFC
-IQNAQAAFC
-IQNAQAAFC
-IQNAQAAFC
FLQNPQTSLC
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSESIPTPSN
KNEAQQKSDL
KNEAQQKSDL
KEEAQQRTDM
KEEAQQRTDM
KDEAQQRSDM
KDEAQQRSDV
REETQQKSNL
Phylip Format
(Sequential)
3 100
Rat
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG
TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA
Mouse
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT
TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG
Rabbit
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC
TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG
Mega Format
#mega
TITLE: No title
#Rat
#Mouse
#Rabbit
#Human
#Oppossum
#Chicken
#Frog
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
---ATGGGTTTGACAGCACATGATCGT---CAGCT
PILEUP Output
> more myseqs.msf
Hsirf2
Muirf2
Chirf2
Muirf1
Ratirf1
Hsirf1
Chkirf1a
Hsirf3a
Mmuirf3
Hsirf5
Mmuirf6
Hump48
Mup48
Hsirf4
Mupip
Huicsbp
Muicsbp
Chkicsbp
1501
SERPSKKGKK
SERPSKKGKK
SERPSKKGKK
LTRNQRKERK
LTKNQRKERK
LTKNQRKERK
LTKDQKKERK
~~~~~~~~~~
~~~~~~~~~~
GPAPTDSQPP
IPQPQGS.VI
...PPGIVSG
...PAGTLPN
...PEGAKKG
...PEGAKKG
...PEEDQK.
...PEEEQK.
...PEEEQK.
PKTEKEDKVK
PKTEKEERVK
TKSEKDDKFK
SKSSRDTKSK
SKSSRDTKSK
SKSSRDAKSK
SKSSREARNK
~~~~~~~~~~
~~~~~~~~~~
EDYSFGAGEE
NPGSTGSAPW
QPGTQKVPSK
QPRNQKSPCK
AKQLTLEDPQ
AKQLTLDDTQ
..........
..........
..........
HIKQEPVESS
HIKQEPVESS
QIKQEPVESS
TKRKLCGDVS
TKRKLCGDSS
AKRKSCGDSS
SKRKLYEDMR
~~~~~~~~~~
~~~~~~~~~~
EEEEEELQRM
DEKDNDVDED
RQHSSVSSER
RSISCVSPER
MSMSHPYTMT
MAMGHPYPMT
..........
..........
..........
LGLSNGVSDL
LGLSNGVSGF
FGI.NGLNDV
PDTFS..DGL
PDTLS..DGL
PDTFS..DGL
MEESA..ERL
~~~~~~~~~~
~~~~~~~~~~
LPSLSLTDAV
EEEDELEQSQ
KEEEDAMQNC
EEN...MENG
TPYPSLPA.Q
APYGSLPAQQ
CKLGVATAGC
CKLGVAPAGC
CKIGVGNGSS
1550
SPEYAVLTST
SPEYAVLTSA
TSDY.FLSSS
SSSTLPDDHS
SSSTLPDDHS
SSSTLPDDHS
TSTPLPDDHS
~~~~~~~~~~
~~~~~~~~~~
QSGPHMTPYS
HHVPIQDTFP
TLSPSVLQDS
RTNGVVNHSD
VHNYMMPPLD
VHNYMMPPHD
VNEVTEMECG
MSEVPEMECG
LTDVGDMDCS
Hsirf2
Muirf2
Chirf2
Muirf1
Ratirf1
Hsirf1
Chkirf1a
Hsirf3a
Mmuirf3
Hsirf5
Mmuirf6
Hump48
Mup48
Hsirf4
Mupip
Huicsbp
Muicsbp
Chkicsbp
1551
IKNEVDSTVN
IKNEVDSTVN
IKNEVDSTVN
SYTTQGYLGQ
SYTAQGYLGQ
SYTVPGYM.Q
SYTAHDYTGQ
~~~~~~~~~~
~~~~~~~~~~
LLKEDVKWPP
FL........
LNNEEEGASG
SGSNIGGGGN
RSWRDYVPDQ
RSWRDYAPDQ
RSEIDELIKE
RSEIEELIKE
PSAIDDLMKE
IIVVGQSHLD
IIVVGQSHLD
IVVVGQPHLD
DLDMER.DIT
DLDMDR.DIT
DLEVEQ.ALT
EVEVENTSIT
~~~~~~~~~~
~~~~~~~~~~
TLQPPTLQPP
NINGSPMAPA
GAVHSDIGSS
GSNRSD...S
PHPEIPYQCP
SHPEIPYQCP
.PSVDDYMGM
.PSVDEYMGM
PPCVDEYLGI
SNIENQEIVT
SNIEDQEIVT
GSSEEQVIVA
PALSPCVVSS
PALSPCVVSS
PALSPCAVSS
LDLSSCEVSG
~~~~~~~~~~
~~~~~~~~~~
VVLGPPAPDP
SVGNCSVGNC
SSSSSPEPQE
NSNCNSELEE
MTFGPRGHHW
VTFGPRGHHW
IKRSPSP...
TKRSPSP...
IKRSPSP...
NPPDICQVVE
NPPDICQVVE
NPPDVCQVVE
SLSEWHMQMD
SLSEWHMQMD
TLPDWHIPVE
SLTDWRMPME
~~~~~~~~~~
~~~~~~~~~~
SPLAPPPGNP
SPESVWP...
VTDTTEAPFQ
GAGTTEATIR
QGPACENGCQ
QGPSCENGCQ
P.DACRS..Q
P.EACRS..Q
PQETCRN..P
1600
VTTESDEQPV
VTTESDDQPV
VTTESDEQPL
I.IPDSTTDL
I.MPDSTTDL
V.VPDSTSDL
IAMADSTNDI
~~~~~~~~~~
~~~~~~~~~~
AGFRELLSEV
......KTEP
........GD
........ED
VTGTFYACAP
VTGTFYACAP
LLPDWWAHEP
ILPDWWVQQP
PIPDWWMQQP
Editing a multiple
sequence alignment

It is NOT “cheating” to edit a multiple sequence
alignment
 heuristic


alignment is approximate
Incorporate additional knowledge if possible
Alignment editors help to keep the data
organized and help to prevent unwanted
mistakes
Alignment editors



The MACAW and SeqVu program for
Macintosh; GeneDoc and DCSE for PCs
are free and provide excellent editor
functionality.
BioEdit Seaview, Jalview (web based)
Many “comprehensive” molecular biology
programs include multiple alignment
functions:

Sequencher, MacVector, DS Gene, Vector NTI, all
include a built-in version of CLUSTAL
EMBOSS tools




emma = clustal
plotcon = PLOTSIMILARITY
showalign = PRETTY
Prettyplot ≈ PRETTYBOX
SeqVu
JalView


Install on
your
machine
or run as a
Java
WebStart
application

Check out CINEMA (Colour
INteractive Editor for Multiple
Alignments)
It
is an editor created completely
in JAVA (old browsers beware)
It includes a fully functional
version of CLUSTAL, BLAST, and
a DotPlot module
http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
Analysis of Alignments

Once you have a multiple alignment,
what can you do with it?
1) Identify regions of similarity and difference
-
-
-
conserved regions may be functionally important,
and/or sites for inclusive (cross species) primer
design
Variable regions may be functionally important,
and/or sites for gene/allele-specific primer design
2) Create a sequence logo
3) Build a Phylogenetic Tree (next week)
Format a Multiple Alignment
• The concept of a consensus sequence is implied by any
multiple alignment. There can be various rules for building
the consensus: simple majority rules, plurality by a
specific %, etc.
• The alignment may look nicer by showing how each letter
matches the consensus – highlight the differences.
1)
PLOTSIMILARITY (a graph of overall similarity
across the alignment) EMBOSS = plotcon
2)
3)
Show match to consensus = showalign
Shade by similarity = prettyplot/Boxshade
Plurality: 2.00 Threshold: 4
AveWeight 0.55 AveMatch 2.91
PRETTY of: @pretty.list
fa10.ugly
fa12.ugly
fo1k.ugly
e.ugly
p1m.ugly
p1s.ugly
p2s.ugly
p3s.ugly
cb3.ugly
r14.ugly
r2.ugly
Consensus
1
..........
..........
..........
Gvenae.kgv
GlgqmlEsmI
GlgqmlEsmI
GigdmiEgav
Giedliseva
...gpvEdaI
GlgdelEevI
...npvEnyI
G-----E--I
AvMisMatch -2.00
October 7, 1998 10:35
..........
..........
..........
tEnTna.Tad
.dnTvreTvg
.dnTvreTvg
.Egitknalv
.qgal..Tls
.......T..
vEkT.kqTv.
dEvlnevlv.
-E-T---T--
..........
..........
..........
fvaqpvyLPe
AatsrdaLPn
AatsrdaLPn
pptstnsLPg
lpkqqdsLPd
Aaigr..vad
Asi.......
.......vPn
A------LP-
..
..TTttGESA
..TTatGESA
..TTsaGESA
.nqT......
teasGPthSk
teasGPahSk
hkpsGPahSk
tkasGPahSk
tvgTGPtnSe
..ssGPkhtq
inssnPttSn
--TTGPGESA
50
D.PvtTtVE.
D.PvtTtVE.
D.PvtTtVE.
kv.Affynrs
eiPALTAVET
eiPALTAVET
eiPALTAVET
evPALTAVET
aiPALTAaET
kvPiLTAnET
saPALdAaET
D-PALTAVET
/////////////////////////////////////////////////////////////////
fa10.ugly
fa12.ugly
fo1k.ugly
e.ugly
p1m.ugly
p1s.ugly
p2s.ugly
p3s.ugly
cb3.ugly
r14.ugly
r2.ugly
Consensus
301
aElyCPRPll
aElyCPRPll
aEtyCPRPll
krvfCPRPtv
irvWCPRPPR
irvWCPRPPR
VrvWCPRPPR
VrvWCPRPPR
VkaWiPRPPR
VEaWiPRaPR
VkaWCPRPPR
VE-WCPRPPR
AIkvtsqdRy
AIevssqdRh
AIhpt.eaRh
ffPwpTsG.D
AlaYygpGvD
AvaYygpGvD
AvPYfgpGvD
AvPYygpGvD
lcqYekakn.
AlPY.Tsigr
AleY.Trahr
AIPY-T-GRD
KqKI.iAPa.
KqKI.iAPg.
KqKI.vAPv.
Kidmtpragv
ykdgtltPls
ykdgtltPls
ykdg.ltPlp
yrn.nldPls
vnfrssgvtt
tny..pknte
tnfkiedrsi
K-KI--AP--
..KQll....
..KQll....
..KQTl....
lmlespnald
tkdlTTy...
tkdlTTy...
ekglTTy...
ekglTTy...
trqsiTtmtn
pvikkrk.gd
qtaivTrpii
--KQTT----
349
.........
.........
.........
isrty....
.........
.........
.........
.........
tgaiwtti.
i.ksy....
ttagpsdmy
---------
Boxshade
Shade each letter of the alignment based on its match to the
consensus
– highlights conserved regions
– much more informative for protein alignments (shades
of grey for similar amino acids)
http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=boxshade
http://www.ch.embnet.org/software/BOX_form.html
Sequence Logos
http://weblogo.berkeley.edu/logo.cgi
http://weblogo.threeplusone.com/create.cgi
http://genome.tugraz.at/Logo/
T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display
consensus sequences. Nucleic Acids Research, Vol. 18, No 20, p. 6097-6100.
Seq Logos are based on
Information Theory

Height of the letter corresponds to the amount
of information present at that position in an
aligned region (motif)
 DNA
has a max of 2 bits (binary of 4), protein has
>4 bits


If many bases/amino acids are present at an
alignment position, there is very little
information
We will explore using motifs next week.
Summary






Understand the need for multiple alignment
methods in biology
Optimal methods (dynamic programming) are
not practical to align many sequences
Progressive pairwise approach
Profile alignments
Editing alignments
Sequence Logos
Next Lecture:
Sequence Motifs