Editing Multiple Alignments
Download
Report
Transcript Editing Multiple Alignments
Multiple Sequence Alignments
Multiple Alignments
Generating multiple alignments
Web servers
Analyzing a multiple alignment
what makes a ‘good’ multiple alignment?
what can it tell us, why is it useful?
Adjusting a multiple alignment
Alignment editors and HowTo
Demonstration and practice
What is a Multiple Alignment?
A comparison of sequences
“multiple sequence alignment”
A comparison of equivalents:
Structurally equivalent positions
Functionally equivalent residues
Secondary structure elements
Hydrophobic regions, polar residues
Generating multiple alignments
Pairwise sequence alignment is easy with
sufficiently closely related sequences.
Below a certain level of identity sequence
alignment may become uncertain :
twilight zone for aa sequences ~ 30%.
In or below the twilight zone it is good to make
use of additional information, eg, from evolution.
A multiple alignment of diverse sequences is
more informative than a pairwise alignment:
residues conserved over longer period of time are under
stronger evolutionary constraints.
Multiple Sequence Alignments
Algorithms
Multiple sequence alignment uses
heuristic methods only:
With dynamic programming, computational
time quickly explodes as the number of
sequences increases.
Different methods/algorithms:
Segment-based (DiAlign, …).
Iterative (HMMs, DiAlign, PRRP, …).
Progressive (Clustalw, T-Coffee, MUSCLE,
…).
Progressive Alignment
Step1: Calculate all pairwise alignments and
calculate distances for all pairs of sequences.
Step 2: Construct guide tree joining the most
similar sequences using Neighbour Joining.
B
C
D
E
F
A
2
4
6
6
8
B C D E
4
6 6
6 6 4
8 8 8 8
Step 1
Step 2
Progressive Alignment
Step 3: From the tree assign weights for each
sequence:
We want to down-weight nearly identical sequences
and up-weight the most divergent ones.
Step 4: Align sequences, starting at the leaves
of the guide tree:
Pairwise comparisons as well as comparison of single
sequence with a group of sequences (Profile)
Caveat: errors introduced early cannot be
corrected by subsequent information
Web servers
ClustalW: http://www.ebi.ac.uk/Tools/clustalw2/
T-Coffee: http://www.ebi.ac.uk/Tools/t-coffee/
MUSCLE: http://www.ebi.ac.uk/Tools/muscle/
DiAlign: http://dialign.gobics.de/
... and more at
http://helix.nih.gov/apps/bioinfo/msa.html.
Clustalw features
Amino acid substitution matrices are varied at different
alignment stages according to the divergence of the
sequences to be aligned.
Reduced gap penalties in hydrophilic regions
encourage new gaps in potential loop regions rather
than regular secondary structure.
Insertions and deletions are
more common in loop regions
than in the core of the protein!
T-Coffee features
More accurate than ClustalW
Instead of amino acid substitution matrices,
uses consistency in a library of pairwise
alignments
Vertices represent positions in protein
j
sequence. Edges represent pairwise
alignments between protein sequences.
If residues I and J have many common
i
neighbours, their consistency is high.
MUSCLE
Fast implementation
Sometimes more accurate than ClustalW
or T-Coffee
Example
Let’s build a multiple alignment for the following
sequences :
>query
MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNETGTISISQLNKNVW
VHTELGYFSGEAVPSNGLVLNTSKGLVLVDSSWDDKLTKELIEMVEKKFKKRV
TDVIITHAHADRIGGMKTLKERGIKAHSTALTAELAKKNGYEEPLGDLQSVTNLK
FGNMKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSASSKDLGNVADAYV
NEWSTSIENVLKRYGNINLVVPGHGEVGDRGLLLHTLDLLK
>gi|2984094
MGGFLFFFLLVLFSFSSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAY
FYVADDGVLVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFR
EVGAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNLNVYLQ
VGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLPFLGSGNSRTWL
VCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKYIKDLRETIRKLYEEGCDVE
CVRERINEELIKIDPSYAQVPVFFNVNPVNAYYVYFEIENEILMGE
>gi|115023|sp|P10425|
MKKNTLLKVGLCVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVW
VHTELGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTD
VIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQTVTNLKFGNTK
VETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNLGNVADAYVNEWSTSIE
NMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK
>gi|115030|sp|P25910|
MKTVFILISMLFPVAVMAQKSVKISDDISITQLSDKVYTYVSLAEIEGWGMVPSNGM
IVINNHQAALLDTPINDAQTEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQR
KGVQSYANQMTIDLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIV
VWLPTENILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGH
GDYGGTELIEHTKQIVNQYIESTSKP
>gi|282554|pir||S25844
MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRARRLAEWV
DKLAAGPGRTVVNTHFHGDHAFGNQVFAPGTRIIAHEDMRSAMVTTGLALTGLWP
RVDWGEIELRPPNVTFRDRLTLHVGERQVELICVGPAHTDHDVVVWLPEERVLFAGD
VVMSGVTPFALFGSVAGTLAALDRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWV
QRLAADAVDRRLTPLQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAM
EIFAELVAYNGGQLPTCLA
ClustalW at EBI
Many options:
CPU mode,
full/fast alignment,
window length in
fast mode,
…
gap penalties.
ClustalW at EBI
Automatic display of:
Score table
Alignment (optional
colouring)
Tree guide
Link to Jalview alignment
editor!
A note on the example
It is atypical:
It uses only three sequences.
One should use more in order to extract reliable
informations.
It illustrates a common mistake:
It uses too closely related sequences.
One should use as divergent and diverse sequences
as possible in order to extract relevant informations.
A Good Multiple Alignment?
Difficult to define…
Good ones look pretty!
Aligned secondary structures
Strongly conserved residues / regions
Comparison with known structure helps
Bad ones look chaotic and random.
A Good Multiple Alignment?
conservation
quality
consensus
☻
?
Multiple Alignment Features
Barton (1993)
“The position of insertions and deletions suggests
regions where surface loops exist…
Multiple Alignment Features
Multiple Alignment Features
Barton (1993)
“The position of insertions and deletions suggests
regions where surface loops exist…
Conserved glycine or proline suggests a β-turn...
Multiple Alignment Features
Multiple Alignment Features
Barton (1993)
“The position of insertions and deletions suggests
regions where surface loops exist…
Conserved glycine or proline suggests a β-turn…
Residues with hydrophobic properties conserved at i,
i+2, i+4 (etc) separated by unconserved or hydrophilic
residues suggests a surface β-strand…
Multiple Alignment Features
Multiple Alignment Features
Barton (1993)
“The position of insertions and deletions suggests
regions where surface loops exist…
Conserved glycine or proline suggests a β-turn…
Residues with hydrophobic properties conserved at i,
i+2, i+4 (etc) separated by unconserved or hydrophilic
residues suggests a surface β-strand…
A short run of hydrophobic amino acids (4 or 5 residues)
suggests a buried β-strand…
Multiple Alignment Features
Multiple Alignment Features
Barton (1993)
Pairs of conserved hydrophobic amino acids separated by
pairs of unconserved or hydrophilic residues suggests an
α-helix with one face packed in the protein core. Similarly,
an i, i+3, i+4, i+7 pattern of conserved residues.”
Multiple Alignment Features
Multiple Alignment Features
Cysteine is a rare amino acid, and is often
used in disulphide bonds ( pairs of
conserved cysteines )
Charged residues ( histidine, aspartate,
glutamate, lysine, arginine ) and other polar
residues embedded in a conserved region
indicate functional importance
Multiple Alignment Features
Quality Assessment
Bad residues
Large distance from column consensus
Bad columns
Average distance from consensus is high
– “entropy”
Bad regions
Profile scores
Bad quality doesn’t always mean
badly aligned!
LP
I E
MR
I M
I K
L I
VD
EQ
I G
VQ
LN
AM
MW
D
L
V
T
W
D
Y
A
A
S
L
D
F
D
N
P
G
G
A
C
R
T
T
L
I
D
R
I
N
A
I
E
V
M
A
K
L
I
Q
Quality Assessment
Profiles
A profile holds scores for each residue type (plus gaps)
over every column of a multiple alignment
Concepts:
• Consensus sequence
• Amino acid similarity
Some multiple alignment programs use profiles to build or
add to an alignment
Any alignment, or even one sequence, can be a profile
(one sequence isn’t a very good one…)
What can we do with a multiple
alignment?
Identify subgroups (phylogeny)
Intra-group sequence conservation
Evolutionary relatedness (view tree)
Identify motifs (functionality)
Evolutionary signals
Highly conserved residues indicate
functional or structural significance!
Widen search for related proteins
MA better than single sequence
Consensus sequence / profile useful
RPDDWHLHLR
GGIDTHVHFI
GFTLTHEHIC
PFVEPHIHLD
PKVELHVHLD
What do we want to do?
Build a homology model?
Accuracy
Perform phylogenetic analysis?
Completeness
Functional analysis of a protein family?
Diversity
Building the initial alignment
Fetch related sequences and run alignment
Clustal, Dialign, TCoffee, Muscle …
Fetch a multiple alignment from a database
and add sequences of interest
Pfam, ProDom, ADDA …
Start from a motif-finding procedure
MEME, Pratt, Gibbs Sampler …
Adjusting the alignment
1. Filter alignment:
Remove any redundancy
Remove unrelated sequences
Remove unwanted domains
Recalculate alignment if necessary
2. Look for conserved motifs, adjust any
misalignments. Try different colour schemes and
thresholds.
3. One step at a time…
Jalview Alignment Editor
Clamp, M., Cuff, J., Searle, S. M. and Barton, G. J. (2004), "The Jalview Java Alignment Editor", Bioinformatics, 20, 426-7.
Colouring your alignment
HYDROPHOBIC
/ POLAR
hydrophobic
polar
BURIED INDEX
buried
surface
β-STRAND
LIKELIHOOD
probable
unlikely
HELIX
LIKELIHOOD
probable
unlikely
Colouring your alignment
By conservation thresholds:
Colouring your alignment
Conservation index
Amino Acid Property
Classification
Schema, eg:
Livingstone & Barton
1993
Sequence Features
Check PDB Structures
Load MA with sequence(s) for known PDB structure
View >> Feature Settings >> Fetch DAS Features (wait...)
OR
Right-click >> Associate Structure with Sequence >> Discover
PDB ids (quicker)
Right-click sequence name >> View PDB Entry
Structure opens in new window – residues acquire MA
colours
Highlight residues by hovering mouse over alignment or
structure
Label residues by clicking on structure
Compare Alignment to Structure
Compare Alignment to Structure
Crucial way of checking alignment!
Where are gaps / insertions /deletions ?
In secondary structures: bad
In surface loops: okay
Where are our key / functional residues?
Are they in probable active site?
Check they are clustered
Check they are accessible, not buried
Demonstration and Practice
1. Start Jalview (click here)
2. Tools >> Preferences >>
Visual
select Maximise Window, unselect Quality, set Font Size to 8 or 9,
Colour >> Clustal, uncheck Open File
Editing
check Pad Gaps When Editing
3. File >> Input Alignment >> from URL (use this one)
4. Get used to the controls – selecting and deselecting
sequences/groups (drag mouse), dragging sequences/groups (use shift/ctrl),
selecting sequence regions, hiding sequences/groups, removing columns and
regions… Then explore menus and tools.
5. Now load this alignment – I’ve messed up a good
alignment, and now I’d like you to correct it! There are two
groups of sequences and one single sequence to adjust.
Demonstration and Practice
6. View >> Feature Settings >> DAS Settings
select Uniprot, dssp, cath, Pfam, PDBsum_ligands, PDBsum_DNAbinding,
then click ‘Save as default’
click Fetch DAS Features (then click yes at prompt) ...
Move mouse over alignment and read information about features
Move mouse over sequence names to check for PDB ids
7. Open a PDB structure (choose any)
8. View >> uncheck Show All Chains, then use up-arrow key
to increase structure size.
9. Hover mouse over structure (see how residues are
highlighted in the sequence), then do same for sequence.
Select residues in the structure by clicking them – a label
will appear. Click again to remove label.
10. Check position of insertions & deletions using this method.