Multiple Sequence Alignment

Download Report

Transcript Multiple Sequence Alignment

Multiple Sequence Alignment
Definition
• Homology: related by descent
• Homologous sequence positions
ATTGCGC
C
ATTGCGC
 ATTGCGC
ATTGCGC
 AT-CCGC
 ATCCGC
Reasons for aligning sets of
sequences
• Organise data to reflect sequence homology
• Infer phylogenetic trees from homologous
sites
• Highlight conserved sites/regions
• Highlight variable sites/regions
• Uncover changes in gene structure
• Summarise information
Alignments help to
Organise
Visualise
Analyze
Sequence Data
The process of aligning sequences
is a game involving playing off gaps
and mismatches
Ways of aligning multiple
sequences
• By hand
• Automated
• Combination
Definition
Optimality criteria: some kind rule or
scoring scheme to help you to decide what
you consider to be the best alignment
Pairwise vs Multiple Sequences
• Pairs of sequences typically aligned using
exhaustive algorithms (dynamic
programming)
– complexity of exhaustive methods is O(2n mn)
n = number of sequences
• Multiple sequence alignment using heuristic
methods
The Correct Alignment
ATTGCGC
C
ATTGCGC
 ATTGCGC
ATTGCGC
 AT-CCGC
 ATCCGC
ATTGCGC
 ATC-CGC
The Correct Alignment
Exhaustive
methods
Heuristic
methods
Correct
according to
optimality
criteria
Always
Correct
according to
homology
Not always
Not always
Not always
• Sequence alignment is easy with
sufficiently closely related sequences
• Below a certain level of identity sequence
alignment may become meaningless
– twilight zone for aa sequences ~ 30%
• In the twilight zone it is good to make use
of additional information if possible (e.g.
structure)
Consensus Sequences
• Simplest Form:
A single sequence which represents the most
common amino acid/base in that position
Y
Y
F
F
Y
Y
D
D
E
D
E
D
D
G
G
G
G
G
G
G
G
G
G
A
I
I
A
A/I
V
L
L
V
V/L
V
V
V
V
E
E
E
Q
Q
E
A
A
A
A
A
A
L
L
L
V
L
L
Multiple Alignment Formats
e.g. Clustal, Phylip, MSF, MEGA etc. etc.
Clustal Format
CLUSTAL X (1.81) multiple sequence alignment
CAS1_BOVIN
CAS1_SHEEP
CAS1_PIG
CAS1_HUMAN
CAS1_RABBIT
CAS1_MOUSE
CAS1_RAT
MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNENMKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNENMKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE-------MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE-------MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERK
MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSE
MKLLILTCLVAAALALPRAHRRNAVSSQTQ------------*:***: **.*.*:* :
.
:
Phylip Format (Interleaved)
7
100
SOMA_BOVIN
SOMA_SHEEP
SOMA_RAT_P
SOMA_MOUSE
SOMA_RABIT
SOMA_PIG_P
SOMA_HUMAN
MMAAGPRTSL
MMAAGPRTSL
-MAADSQTPW
-MATDSRTSW
-MAAGSWTAG
-MAAGPRTSA
-MATGSRTSL
LLAFALLCLP
LLAFTLLCLP
LLTFSLLCLL
LLTVSLLCLL
LLAFALLCLP
LLAFALLCLP
LLAFGLLCLP
WTQVVGAFPA
WTQVVGAFPA
WPQEAGAFPA
WPQEASAFPA
WPQEASAFPA
WTREVGAFPA
WLQEGSAFPT
MSLSGLFANA
MSLSGLFANA
MPLSSLFANA
MPLSSLFSNA
MPLSSLFANA
MPLSSLFANA
IPLSRLFDNA
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
VLRAQHLHQL
MLRAHRLHQL
AADTFKEFER
AADTFKEFER
AADTYKEFER
AADTYKEFER
AADTYKEFER
AADTYKEFER
AFDTYQEFEE
TYIPEGQRYS
TYIPEGQRYS
AYIPEGQRYS
AYIPEGQRYS
AYIPEGQRYS
AYIPEGQRYS
AYIPKEQKYS
-IQNTQVAFC
-IQNTQVAFC
-IQNAQAAFC
-IQNAQAAFC
-IQNAQAAFC
-IQNAQAAFC
FLQNPQTSLC
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSETIPAPTG
FSESIPTPSN
KNEAQQKSDL
KNEAQQKSDL
KEEAQQRTDM
KEEAQQRTDM
KDEAQQRSDM
KDEAQQRSDV
REETQQKSNL
Phylip Format (Sequential)
3 100
Rat
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG
TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA
Mouse
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT
TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG
Rabbit
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC
TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG
Mega Format
#mega
TITLE: No title
#Rat
#Mouse
#Rabbit
#Human
#Oppossum
#Chicken
#Frog
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
---ATGGGTTTGACAGCACATGATCGT---CAGCT
Progressive Multiple Alignment
• Heuristic
• Perform pairwise alignments
• Align sequences to alignments or
alignments to existing alignments (profile
alignments
• Do the alignments in some sensible order
Iterative methods
• Several progressive alignment methods can
be iterated
– e.g. Barton-Sternberg, ClustalX
ClustalX Algorithm
• Perform alignments and calculate distances for all
pairs of sequences
• Construct guide tree (dendrogram) joining the
most similar sequences using Neighbour Joining
• Align sequences, starting at the leaves of the guide
tree. This involves the pair-wise comparisons as
well as comparison of single sequence with a
group of seqs (Profile)
• ClustalX is not optimal
• There are known areas in which ClustalX
performs badly e.g.
– errors introduced early cannot be corrected by
subsequent information
– alignments of sequences of differing lengths
cause strange guide trees and unpredictable
effects
– edges: ClustalX does not penalise gaps at edges
• There are alternatives to ClustalX available
Using ClustalX
• Start with sequences in FASTA format (or
an existing alignment in Clustal format
• [Do Alignment] on the alignment menu
ClustalX Parameters
•
•
•
•
•
•
Scoring Matrix
Gap opening penalty
Gap extension penalty
Protein gap parameters
Additional algorithm parameters
Secondary structure penalties
Score Matrices
• Pairwise matrices and multiple alignment
matrix series
• PAM (Dayhoff), BLOSUM (Hennikof),
GONNET (default), user defined
• Transition (A<->G)/Transversion (C<-T)
ratio – low for distantly related sequences
Gap Penalties
• Linear gap penalties – Affine gap penalties
p = (o + l.e)
• Gap opening
• Gap extension
• Protein specific penalties (on by default)
– Increase the probability of gaps associated with certain
residues
– Increase the chances of gaps in loop regions (> 5
hydrophilic residues)
Algorithm parameters
•
•
•
•
Slow-accurate pair-wise alignment
Do alignment from guide tree
Reset gaps before aligning (iteration)
Delay Divergent sequences (%)
Additional displays
• Column Scores
• Low quality regions
• Exceptional residues
Multiple Alignment Strategies
•
•
•
•
Align pairs of sequences using an optimal method
Choose representative sequences to align carefully
Choose sequences of comparable lengths
Progressive alignment programs such as ClustalX
for multiple alignment
• Progressive alignment programs may be combined
• Review alignment by eye and edit
Alignment of coding regions
• Nucleotide sequences much harder to align
accurately than proteins
• Protein coding sequences can be aligned
using the protein sequences
Multiple Alignments and Phylogenetic Trees
– You can make a more accurate multiple
sequence alignment if you know the tree
already
– A good multiple sequence alignment is an
important starting point for drawing a tree
– The process of constructing a multiple
alignment (unlike pair-wise) needs to take
account of phylogenetic relationships
Editing a multiple sequence
alignment
• It is NOT fraud to edit a multiple sequence
alignment
• Incorporate additional knowledge if
possible
• Alignment edititors help to keep the data
organised and help to prevent unwanted
mistakes
Alignment Editors
• e.g. GDE, Bioedit, Seaview, Jalview etc.
• Alignment editors can function as an
organisational tool (analyses tools on
BioEdit)
• Construct sub-sequences (GDE, Seaview)
• Annotate sequences (Seaview)
Aligning weakly similar
sequences
Sequence contains conserved
regions
• e.g. DIALIGN (Morgenstern, Dress, Werner)
– re-aligns regions between conserved blocks
http://bibiserv.techfak.uni-bielefeld.de/
useful if sequences contains consistent conserved blocks
• Block Maker – searches for conserved words that
may be inconsistent http://blocks.fhcrc.org/
Profile Alignment
Gribskov et al. 1987
• Position specific scores
• Allows alignment of alignments
• Gaps introduced as whole columns in the separate
alignments
• Optimal alignment in time O(a2l2)
a = alphabet size, l = sequence length
• Information about the degree of conservation of
sequence positions is included
Good reasons to use profile
alignments
– Adding a new sequence to an existing multiple
alignment that you want to keep the same
(align sequence to profile)
– Searching a database for new members of your protein
family
(pfsearch)
– Searching a database of profiles to find out which one
your sequence belongs to
(pfscan)
– Combining two multiple sequence alignments
(profile to profile)
Profile Alignment Using
ClustalX
•
•
•
•
Profile Alignment Mode
Align sequence to profile
Align profile 1 to profile 2
Secondary structure parameters
Profile searching using PSIBLAST
• Position Specific Iterative
• Perform search – construct profile –
perform search
• Convergence (hopefully…)
• Increased sensitivity for distantly related
sequences
• Available on-line (NCBI)
Databases of Aligned Sequences
• Hovergen http://pbil.univlyon1.fr/databases/hovergen.html (vertebrate
alignments)
• Pfam http://www.sanger.ac.uk/Software/Pfam/
(protein domain alignments and profile HMMs)
• BLOCKS http://blocks.fhcrc.org/
• Ribosomal Database Project
http://rdp.cme.msu.edu/html/ alignments and trees
derived from rRNA sequences
• Interpro – combines information from other
sources
• Many more…
Probabilistic Models of Sequence
Alignment
• Hidden Markov Models
– sequence of states and associated symbol probabilities
• Produces a probabilistic model of a sequence
alignment
• Align a sequence to a Profile Hidden Markov
Model
– Algorithms exist to find the most efficient pathway
through the model
Markov Chain: A chain of things. The
probability of the next thing depends only
on the current thing
Hidden Markov Model: A sequence of states
which form a Markov Chain. The states are
not observable. The observable characters
have “emission” probabilities which depend
on the current state.