Multiple Sequence Alignment

Transcript Multiple Sequence Alignment

Multiple Sequence Alignment
Definition
• Homology: related by descent
• Homologous sequence positions
ATTGCGC
A
ATTGCGC
 ATTGCGC
ATTGCGC
 AT-ACGC
 ATACGC
Reasons for aligning sets of
sequences
•
•
•
•
•
•
•
•
Organise data to reflect sequence homology
Estimate evolutionary distance
Infer phylogenetic trees from homologous sites
Highlight conserved sites/regions
Highlight variable sites/regions
Uncover changes in gene structure
Look for evidence of selection
Summarise information
Alignments help to
Organise
Visualise
Analyze
Sequence Data
The process of aligning sequences
is a game involving playing off gaps
and mismatches
Ways of aligning multiple
sequences
• By hand
• Automated
• Combination
Definition
Optimality criteria: some kind rule or
scoring scheme to help you to decide what
you consider to be the best alignment
Pairwise vs Multiple Sequences
• Pairs of sequences typically aligned using
exhaustive algorithms (dynamic
programming)
– complexity of exhaustive methods is O(2n mn)
n = number of sequences m = sequence length
• Multiple sequence alignment usually
performed using heuristic methods
The Correct Alignment
ATTGCGC
A
ATTGCGC
 ATTGCGC
ATTGCGC
 AT-ACGC
 ATACGC
ATTGCGC
 ATA-CGC
The Correct Alignment
Exhaustive
methods
Heuristic
methods
Correct
according to
optimality
criteria
Always
Correct
according to
homology
Not always
Not always
Not always
• Sequence alignment is easy with
sufficiently closely related sequences
• Below a certain level of identity sequence
alignment may become meaningless
– twilight zone for aa sequences ~ 30%
• In the twilight zone it is good to make use
of additional information if possible (e.g.
structure)
Consensus Sequences
• Simplest Form:
A single sequence which represents the most
common amino acid/base in that position
Y
Y
F
F
Y
Y
D
D
E
D
E
D
D
G
G
G
G
G
G
G
G
G
G
A
I
I
A
A/I
V
L
L
V
V/L
V
V
V
V
E
E
E
Q
Q
E
A
A
A
A
A
A
L
L
L
V
L
L
Multiple Alignment Formats
e.g. Clustal, Phylip, MSF, MEGA etc. etc.
Clustal Format
CLUSTAL X (1.81) multiple sequence alignment
CAS1_BOVIN
CAS1_SHEEP
CAS1_PIG
CAS1_HUMAN
CAS1_RABBIT
CAS1_MOUSE
CAS1_RAT
MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNENMKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNENMKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE-------MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE-------MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERK
MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSE
MKLLILTCLVAAALALPRAHRRNAVSSQTQ------------*:***: **.*.*:* :
.
:
Phylip Format (Interleaved)
7
100
SOMA_BOVIN
SOMA_SHEEP
SOMA_RAT_P
SOMA_MOUSE
SOMA_RABIT
SOMA_PIG_P
SOMA_HUMAN
MMAAGPRTSL
MMAAGPRTSL
-MAADSQTPW
-MATDSRTSW
-MAAGSWTAG
-MAAGPRTSA
-MATGSRTSL
LLAFALLCLP
LLAFTLLCLP
LLTFSLLCLL
LLTVSLLCLL
LLAFALLCLP
LLAFALLCLP
LLAFGLLCLP
WTQVVGAFPA
WTQVVGAFPA
WPQEAGAFPA
WPQEASAFPA
WPQEASAFPA
WTREVGAFPA
WLQEGSAFPT
MSLSGLFANA
MSLSGLFANA
MPLSSLFANA
MPLSSLFSNA
MPLSSLFANA
MPLSSLFANA
IPLSRLFDNA
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
MLRAHRLHQL
AADTFKEFER
AADTFKEFER
AADTYKEFER
AADTYKEFER
AADTYKEFER
AADTYKEFER
AFDTYQEFEE
TYIPEGQRYS
TYIPEGQRYS
AYIPEGQRYS
AYIPEGQRYS
AYIPEGQRYS
AYIPEGQRYS
AYIPKEQKYS
-IQNTQVAFC
-IQNTQVAFC
-IQNAQAAFC
-IQNAQAAFC
-IQNAQAAFC
-IQNAQAAFC
FLQNPQTSLC
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSESIPTPSN
KNEAQQKSDL
KNEAQQKSDL
KEEAQQRTDM
KEEAQQRTDM
KDEAQQRSDM
KDEAQQRSDV
REETQQKSNL
Phylip Format (Sequential)
3 100
Rat
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG
TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA
Mouse
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT
TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG
Rabbit
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC
TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG
Mega Format
#mega
TITLE: No title
#Rat
#Mouse
#Rabbit
#Human
#Oppossum
#Chicken
#Frog
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
---ATGGGTTTGACAGCACATGATCGT---CAGCT
Progressive Multiple Alignment
• Heuristic
• Perform pairwise alignments
• Align sequences to alignments or
alignments to existing alignments (profile
alignments
• Do the alignments in some sensible order
Progressive versus Simultaneous
• speed versus accuracy
• simultaneous methods are capable of
working out an ‘exact’ solution to the
problem of multiple sequence alignment
(e.g. NCBI’s MSA – user interface QAlign)
Iterative methods
• Several progressive alignment methods can
be iterated
– e.g. Barton-Sternberg, ClustalX
ClustalX Algorithm
• Perform pairwise alignments and calculate
distances for all pairs of sequences
• Construct guide tree (dendrogram) joining the
most similar sequences using Neighbour Joining
• Align sequences, starting at the leaves of the guide
tree. This involves the pair-wise comparisons as
well as comparison of single sequence with a
group of seqs (Profile)
• ClustalX is not optimal
• There are known areas in which ClustalX
performs badly e.g.
– errors introduced early cannot be corrected by
subsequent information
– alignments of sequences of differing lengths
cause strange guide trees and unpredictable
effects
– edges: ClustalX does not penalise gaps at edges
• There are alternatives to ClustalX available
T-Coffee
• JMB 2000
• Also a progressive alignment method
• Designed to solve some of the problems
with clustal (in particular the problem of
clustals inability to correct errors that
appear early in the process of alignment)
• Can consider global and local pair-wise
alignments
Using ClustalX
• Start with sequences in FASTA format (or
an existing alignment in Clustal format
• [Do Alignment] on the alignment menu
ClustalX Parameters
•
•
•
•
•
•
Scoring Matrix
Gap opening penalty
Gap extension penalty
Protein gap parameters
Additional algorithm parameters
Secondary structure penalties
Score Matrices
• Pairwise matrices and multiple alignment
matrix series
• PAM (Dayhoff), BLOSUM (Hennikof),
GONNET (default), user defined
• Transition (A<->G)/Transversion (C<-T)
ratio – low for distantly related sequences
Gap Penalties
• Linear gap penalties – Affine gap penalties
p = (o + l.e)
• Gap opening
• Gap extension
• Protein specific penalties (on by default)
– Increase the probability of gaps associated with certain
residues
– Increase the chances of gaps in loop regions (> 5
hydrophilic residues)
Algorithm parameters
•
•
•
•
Slow-accurate pair-wise alignment
Do alignment from guide tree
Reset gaps before aligning (iteration)
Delay Divergent sequences (%)
Additional displays
• Column Scores
• Low quality regions
• Exceptional residues
Multiple Alignment Tips
• Align pairs of sequences using an optimal method
• Progressive alignment programs such as ClustalX
for multiple alignment
• Choose representative sequences to align carefully
• Choose sequences of comparable lengths
• Progressive alignment programs may be combined
• Review alignment by eye and edit
• If you have a choice align amino acid sequences
rather than nucleotides
Alignment of coding regions
• Nucleotide sequences much harder to align
accurately than proteins
• Protein coding sequences can be aligned using the
protein sequences
– e.g. BioEdit: toggle translation to amino acid, call
clustalw to align, edit alignment by hand, toggle back
to nucleotide
• In-frame nucleotide alignments can be used, e.g.
to determine non-synonymous and synonymous
distances separately
Multiple Alignments and Phylogenetic Trees
– You can make a more accurate multiple
sequence alignment if you know the tree
already
– A phylogenetic tree is only as good as the
alignment from which it was produced
– The process of constructing a multiple
alignment (unlike pair-wise) needs to take
account of phylogenetic relationships
Editing a multiple sequence
alignment
• It is NOT fraud to edit a multiple sequence
alignment
• Incorporate additional knowledge if
possible
• Alignment editors help to keep the data
organised and help to prevent unwanted
mistakes
Alignment Editors
• e.g. GDE, Bioedit, Seaview, Jalview etc.
• Some alignment editors have begun to
function as sequence analysis platforms
(e.g. tools on BioEdit, GDE)
• Construct sub-sequences (GDE, Seaview)
• Annotate sequences (Seaview)
Aligning weakly similar
sequences
Sequence contains conserved
regions
• e.g. DIALIGN (Morgenstern, Dress, Werner)
– re-aligns regions between conserved blocks
http://bibiserv.techfak.uni-bielefeld.de/
useful if sequences contains consistent conserved blocks
• Block Maker – searches for conserved words that
may be inconsistent http://blocks.fhcrc.org/
Profile Alignment
Gribskov et al. 1987
• Position specific scores
• Allows addition of extra sequence(s) to an
alignment
• Allows alignment of alignments
• Gaps introduced as whole columns in the separate
alignments
• Optimal alignment in time O(a2l2)
a = alphabet size, l = sequence length
• Information about the degree of conservation of
sequence positions is included
Good reasons to use profile
alignments
– Adding a new sequence to an existing multiple
alignment that you want to keep fixed
(align sequence to profile)
– Searching a database for new members of your protein
family
(pfsearch)
– Searching a database of profiles to find out which one
your sequence belongs to
(pfscan)
– Combining two multiple sequence alignments
(profile to profile)
Profile Alignment Using
ClustalX
•
•
•
•
Profile Alignment Mode
Align sequence to profile
Align profile 1 to profile 2
Secondary structure parameters
Profile searching using PSIBLAST
• Position Specific Iterative
• Perform search – construct profile –
perform search
• Convergence (hopefully…)
• Increased sensitivity for distantly related
sequences
• Available on-line (NCBI)
Databases of Aligned Sequences
• Hovergen http://pbil.univlyon1.fr/databases/hovergen.html (vertebrate
alignments)
• Pfam http://www.sanger.ac.uk/Software/Pfam/
(protein domain alignments and profile HMMs)
• BLOCKS http://blocks.fhcrc.org/
• Ribosomal Database Project
http://rdp.cme.msu.edu/html/ alignments and trees
derived from rRNA sequences
• Interpro – combines information from other
sources
• Many more…
Probabilistic Models of Sequence
Alignment
• Hidden Markov Models
– sequence of states and associated symbol probabilities
• Produces a probabilistic model of a sequence
alignment
• Align a sequence to a Profile Hidden Markov
Model
– Algorithms exist to find the most efficient pathway
through the model
Markov Chain: A chain of things. The
probability of the next thing depends only
on the current thing
Hidden Markov Model: A sequence of states
which form a Markov Chain. The states are
not observable. The observable characters
have “emission” probabilities which depend
on the current state.
Some more recent developments
• The need to align genomes
– alignment tools required that can align very
large regions of genomes
– poses a computational challenge
– programmes such as dialign can be run in
parallel on multiprocessor machines
Some more recent developments
• MUSCLE
– Faster (uses a k-mer frequency to calculate first pairwise alignments)
– Progressive (repeats the MSA using the more accurate
kimura distance between aligned amino acid sequences)
– Has a third optimisation stage that involves making
profile alignments of sub-trees and accepting the new
alignment if it improves the SP score.
• MuSiC - multiple sequence alignment with
constraints
– web server that allows a user to enter a set of

Multiple Sequence Alignment

Transcript Multiple Sequence Alignment

Directory