Ei dian otsikkoa - Helsingin yliopisto

Download Report

Transcript Ei dian otsikkoa - Helsingin yliopisto

Protein Analysis Workshop 2006
Pairwise and multiple
sequence alignments
Alain Schenkel
Tuomas Hätinen
Bioinformatics group
Institute of Biotechnology
University of Helsinki
Overview
 Motivation – Why alignments?
 Sequence comparison
 Dotplot
 The
alignment problem
 Pairwise alignment algorithms
 Exact
algorithms
 Heuristic
algorithms
 Database
searches
 Multiple sequence alignments
 Web tools:
 Build
alignments using SRS or EBI server,
 Blast
at NCBI, EBI,
 PairsDB,
…
Motivation
 Proteins perform most of the functions required in biological systems:
 Signaling
(kinases, ...)
 Enzymes
(proteases, …)
 Structural
(collagen, elastin, …)
 Immune
system (antibodies, ...)
 Storage
and transport (hemoglobin, …)
…
 Large amount of information available in current databanks.
Goal: Want to extrapolate information about the function of
a newly discovered sequence by comparing it to
annotated sequences.
Does it make sense?
 All functional information is ultimately contained within the sequence.
 Proteins are evolutionary related:

Selective pressure is on function, and thus on residues with functional role
(eg: active site or structural key residues are conserved).

Modular nature of proteins.
 Two sequences have the same structure if corresponding residues are
similar enough on physico-chemical level.
Application of sequence alignments
 Determining function of newly discovered genetic or protein
sequences.
 Identification of functional patterns/domains.
 Predicting structure of proteins.
 Determining evolutionary relationships among genes, proteins,
and entire species.
Aligning and comparing sequences, and searching
databases for similar sequences – a cornerstone of
bioinformatics!!
Sequence Comparison
• Alignment
• Dotplots
• The pairwise alignment problem
Pairwise alignment
Pairwise alignment = identification of residue-residue correspondence.
?????
101
GLP_HORSE
60
AGVIGTILLISYGIRRLIKKSPSDVKP
||:||.|||::|..|||.|:.|:||.|
AGIIGIILLLAYVSRRLRKRPPADVPP
115
86
For the alignment to be meaningful, the correspondence should
reflect the functional, or evolutionary, …, relationship (if any).
What criteria should we use to obtain biologically meaningful
alignments?
Some terminology
 Identity:

percentage of pairs of identical residues between two aligned sequences.
 Similarity:

percentage of pairs of similar residues between two aligned sequences.

one must define what similar means. Eg:
- as observed in well studied evolutionary
related protein families,
- physico-chemical amino acid
properties: hydropathy, size, …
 Homology:

two sequences are homologous if and only if they have a common ancestor.

it´s either yes or no.

not to be confused with similarity!
Dotplots
Sequence 1
two sequences:

A dot is placed where both
sequence elements are
identical.
 Gives an overview of all
possible alignments.
 Each diagonal indicates a
possible (ungapped) alignment.
Sequence 2
 The simplest way of comparing
A
C
●
T
A
T
T
T
●
●
A
T
●
●
●
●
●
G
T
G
●
C
A
C
●
●
●
●
●
One possible alignment:
ATCTTCGAT
| ||||
---TACGAT
●
Filtering Out the Noise in Dotplots
 Dots may be scored according to a sliding window and a similarity
cutoff to reduce noise:
Window size = 5, Similarity cutoff = 3
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
| |
||
||||
|
|| |||
|
LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
| |
||
||||
|
|| |||
|
LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
| |
||
||||
|
|| |||
|
LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
L
S
T
I
K
R
K
L
D
*
T
G
Q
*
Y
Q
E
P
W
Q
…
 The smaller the window, the more noise.
 With large windows, the sensitivity for short sequences is reduced.
Using Dotmatcher from SRS
 SRS at EBI: http://srs.ebi.ac.uk/
 SRS at EMBnet Austria: http://emb2.bcc.univie.ac.at:8080/srs/
 ... or any servers listed at http://downloads.lionbio.co.uk/publicsrs.html
Check out the SRS version (bottom of page): different versions index
different databases, so the search results might be different
depending on the version.
DotmatcherP (for proteins)
Enter sequences in
FASTA format!
Advanced options: Change
default window size,
threshold score and scoring
matrix
DotmatcherP
Comparing a protein with itself.
Eg: Drosophila Melanogaster SLIT
 Identification of
repeated protein domains
DotmatcherP
Comparing two different sequences:
 Identification of conserved protein domains.
 Using the default parameters window size = 10 and
threshold = 23:
DotmatcherP
 If we lower the window size and the threshold, we
observe lots of noise.
 Eg, with window size = 5, threshold = 10:
Another Dotplot server: Dotlet
 Has more options and provides more flexibility than Dotmatcher.
 Some very useful features:
 If
only one sequence is entered, dotlet automatically compares it
against itself (finding repeats, low complexity regions, etc.).
 Same
application for both nucleic acid and protein sequences.
 When
comparing nucleic acid to nucleic acid, dotlet will reverse
complement one of the sequences and perform a second
comparison. Enables, eg, to see structures like stem-loops.
 Possible
to compare a protein to a nucleic acid sequence. The
nucleic acid sequence is translated in the three forward frames and
pixels are set to the highest of the scores. Enables, eg, to detect
introns/exons, frameshift, etc.
Dotlet
At http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
Let´s find repeated domains
in the following sequence :
> SLIT_DROME (P24014):
MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIP
GGGVGVITEARCPRVCSCTGLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQR
LTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVITTVGRRVFKGAQSLRSLQLDNNQ
ITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSWLSRF
LRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGI
VDCREKSLTSVPVTLPDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDAL
SGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLLLNANEISCIRKDAFRDLHSLSLLSLYD
NNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCESPKRMHRR
RIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTT
ELLLNDNELGRISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEI
SNKMFLGLHQLKTLNLYDNQISCVMPGSFEHLNSLTSLNLASNPFNCNCHLAWFAECVRKK
SLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCTCTGTVVACSRNQ
LKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLS
TLIISYNKLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDC
GLKWFSDWIKLDYVEPGIARCAEPEQMKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQP
CQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNATCTVLEEGRFSCQCAPG
YTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCAN
GAKCMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKY
CEGHNMISMMYPQTSPCQNHECKHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHN
NSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAVELFNGRIRVSYDVGNHPVST
MYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDP
AQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQD
FMDETPHIKEEPVDPCLENKCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPT
VTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGNQCCAAKIVRRRKVRMVCSNNRKY
IKNLDIVRKCGCTKKCY
2. Enter name
for sequence
(optional)
1. Enter
sequence
4. Select scoring
matrix, window
size and zoom
3. Repeat for
second sequence
(optional)
4. Click ”compute”!
Each pixel corresponds to a residue in the
horisontal sequence and to a residue in the
vertical sequence
The pixels color depends on
how similar the two
sequences are around
these two positions
Tuning of
grayscale in
order to make
background
noise
disappear
Residues that
match well in the
alignment are
coloured blue
Possible to
scroll the
dotplot here
Possible to
scroll the
alignment
here
Dotlet reverse
complements one
of the sequences
 stem-loops can
be detected
Dotplot - Summary
 Comparing a sequence with itself, can be used to
identify:

Repeated domains,

Regions of low complexity (eg, …GYCAAAAAAAAALK…).
 Comparing two protein sequences, can be used to
identify:

Local regions of similarity,

Conserved protein domains.
Dotplot - Summary
 Good:

visual detection of feature/similarity,

exploring the sequence organisation.
 Bad:

resolving regions of low similarity,

does not provide an alignment (no insertions/deletions).
To obtain an alignment, we need a
method for lining up the diagonals in
a dotplot.
G A
G
A
T
T
C T
A
1
2
3
GATCTA
GATC_A
C
A
4
5
The Pairwise Alignment Problem
 Lign up diagonal by edit operations:

substitution (mutation)

gap or indel (insertion/deletion)
sequence 1
deletion
seq1 IGTILLISYGIRRLIKKSPSDVKP----LPSPDTDVP
|| ||| | ||| | | || |
|| |
|
seq2 IGIILLLAYVSRRLRKRPPADVPPPASTVPSADAPPP
insertion
But there are many ways to align 2
sequences  we need to score
alignments to decide which is the best.
sequence 2
substitution
gap
Scoring the Edit Operations
 For example:

identical: +10 (it´s good)

substitution: +2 for S-A, -1 for K-P, …

gap: -3
PSDVKP--P
| || | |
PADVPPPAP
Score: +50+2-1+2*(-3) = 45
Choosing an appropriate scoring scheme: where
biological information is introduced (eg, reward the
evolutionary most likely alignment).
Standard notation:  | for identical
 : for very similar (eg, size and hydropathy)
 . for somewhat similar (eg, size or hydropathy)
Gap penalty
TIL--------LISYGIRRLIK
Few long gaps
TILKKSPSDVKLISYGIRRLIK
is better than
many small gaps
IG-TI--LYDL-SYYAG---IR
IGKIIPRL--LVAY--VLIGSR
 Different scores for

gap opening, eg: -5

gap extension, eg: L(-1) with

gap opening
gap extension
L=length of extension
TIL--------LISYGIRRLIK
gap opening > gap extension
TILKKSPSDVKLISYGIRRLIK
gap score= -5 -6
Gap penalty
 Can also consider special penalty for gaps at end/beginning of
alignment (eg, zero penalty).
 Need to be careful in adjusting the gap score to the substitution
score:
 too
strong penalty  no gaps,
 too
weak penalty
 too many gaps.
 Insertions and deletions have been found to occur in nature at
significantly lower frequency than mutations.
Residue Substitution
 A substitution score for each aa pair
 a substitution matrix.
 Most used: based on evolutionary relationship.
 Two types:
 PAM
series,
 BLOSUM
series.
PAM (Percent Accepted Mutation)
 PAM1: observed mutations in
PAM250
carefully selected sets of closely
related proteins (1572 sequences
from 71 families). (1978)
 Idea: observed substitutions are the
result of 1 mutation (not many).
 PAMn: iterate PAM1 n times to
obtain substitution rate between
more divergent sequences.
Use
when
PAM:
0
%identity: 100
30
75
80
60
110
50
200
25
250
20
BLOSUM (BLOck Substitution Matrix)
 Based on a larger set than PAM is.
 More recent than PAM. (1992)
 Different approach than PAM:

not based on an explicit evolutionary
model,

observed aa substitutions in a set of
conserved aa patterns called blocks.
 BLOSUMn: from blocks which are n%
identical.
 BLOSUM62: empirically shown to be among
the best at detecting weak similarity.
BLOSUM62
Tips for using substitution matrices
 Generally, BLOSUM matrices perform better than PAM for local
similarity searches.
 For database searches, the most commonly used matrix is
BLOSUM62.
 When comparing closely related proteins, one should use lower
PAM or higher BLOSUM, for distantly related proteins higher PAM
or lower BLOSUM matrices
BLOSUM 8
BLOSUM 62
BLOSUM 45
PAM 1
PAM 120
PAM 250
Less divergent
More divergent
 Caution: substitution matrices are statistical in nature. In a given
alignment, a substitution may or may not correspond to an actual
mutation.
Pairwise alignment algorithms
• Exact algorithms
• Heuristic algorithms
• Database scanning
Pairwise Alignment Algorithms
 Given a scoring scheme, an alignment algorithm tries to find the best
alignment between 2 sequences according to that scheme.
 Exact algorithms:
 guaranteed
to return an alignment with the best possible score.
 Heuristic alignments:
 not
guaranteed to return best alignments.
 but
they are quicker (and hopefully still return good alignments).
 Two types of alignment:
 Global:
 Local:
forced over the entire length of 2 sequences.
between substrings of 2 sequences..
Global vs Local Alignment
 Global alignments:
 are
 do
sensitive to gap penalties,
not take into account the modular nature of
proteins,
 can
be used to compare 2 proteins with same
function (in, eg, human/mouse).
 Local alignments: are sensitive to modular nature
of proteins. They can be used to:
 look
for conserved domains or motifs in 2 proteins,
 search
for local similarities in large sequences,
 database
searches,
 scanning
an entire genome with a short sequence.
Exact Algorithms: Dynamic Programming
How can we find the best alignment between 2 sequences?
 Exhaustive search among all possible
alignments is not possible (eg, for 2 sequences of
100 and 95 residues: 55 millions alignments with 5
gaps).
 Problem solved by dynamic programming:
1.
initialize top row and left column,
2.
compute best local scores iteratively,
3.
keep track of where best local score comes from,
4.
traceback to obtain the best alignments.
 May exist several best solutions: an alignment
reported to you may be one among a number of
best global score
Example of 2 best solutions:
ATTCTCTGA
-TAC--TGA
ATTCTCTGA
-TA--CTGA
possibilities.
The example is from www.pasteur.fr
Global Alignment Servers (Exact Algorithm)
Use the Needleman-Wunsch algorithm (1970).
 Server at SRS: NeedleP. (http://srs.ebi.ac.uk/  Tools)
 Server at EBI: EMBOSS-Align
 Let´s
submit to http://www.ebi.ac.uk/emboss/align/index.html the
sequences :
>uniprot|P35858|ALS_HUMAN Insulin-like growth factor-binding protein complex
MALRKGGLALALLLLSWVALGPRSLEGADPGTPGEAEGPACPAACVCSYDDDADELSVFC
SSRNLTRLPDGVPGGTQALWLDGNNLSSVPPAAFQNLSSLGFLNLQGGQLGSLEPQALLG
LENLCHLHLERNQLRSLALGTFAHTPALASLGLSNNRLSRLEDGLFEGLGSLWDLNLGWN
SLAVLPDAAFRGLGSLRELVLAGNRLAYLQPALFSGLAELRELDLSRNALRAIKANVFVQ
LPRLQKLYLDRNLIAAVAPGAFLGLKALRWLDLSHNRVAGLLEDTFPGLLGLRVLRLSHN
AIASLRPRTFKDLHFLEELQLGHNRIRQLAERSFEGLGQLEVLTLDHNQLQEVKAGAFLG
LTNVAVMNLSGNCLRNLPEQVFRGLGKLHSLHLEGSCLGRIRPHTFTGLSGLRRLFLKDN
GLVGIEEQSLWGLAELLELDLTSNQLTHLPHRLFQGLGKLEYLLLSRNRLAELPADALGP
LQRAFWLDVSHNRLEALPNSLLAPLGRLRYLSLRNNSLRTFTPQPPGLERLWLEGNPWDC
GCPLKALRDFALQNPSAVPRFVQAICEGDDCQPPAYTYNNITCASPPEVVGLDLRDLSEA
HFAPC
>uniprot|O08770|GPV_RAT Platelet glycoprotein V precursor (GPV) (CD42D).
MLRSVLLSAVLSLVGAQPFPCPKTCKCVVRDAVQCSGGSVAHIAELGLPTNLTHILLFRM
DRGVLQSHSFSGMTVLQRLMLSDSHISAIDPGTFNDLVKLKTLRLTRNKISHLPRAILDK
MVLLEQLFLDHNALRDLDQNLFQKLLNLRDLCLNQNQLSFLPANLFSSLGKLKVLDLSRN
NLTHLPQGLLGAQIKLEKLLLYSNRLMSLDSGLLANLGALTELRLERNHLRSIAPGAFDS
LGNLSTLTLSGNLLESLPPALFLHVSWLTRLTLFENPLEELPEVLFGEMAGLRELWLNGT
HLRTLPAAAFRNLSGLQTLGLTRNPLLSALPPGMFHGLTELRVLAVHTNALEELPEDALR
GLGRLRQVSLRHNRLRALPRTLFRNLSSLVTVQLEHNQLKTLPGDVFAALPQLTRVLLGH
NPWLCDCGLWPFLQWLRHHLELLGRDEPPQCNGPESRASLTFWELLQGDQWCPSSRGLPP
DPPTENALKAPDPTQRPNSSQSWAWVQLVARGESPDNRFYWNLYILLLIAQATIAGFIVF
AMIKIGQLFRTLIREELLFEAMGKSSN
gap penalties
gap penalties
choose scoring
matrix
NeedleP at SRS
options for gap
penalties
choose scoring
matrix (optional)
Local Alignment Servers (Exact Algorithm)
 Server at EMBnet: LALIGN, uses SIM algorithm (1991)
 http://www.ch.embnet.org/software/LALIGN_form.html
 Server at SRS:
 http://srs.ebi.ac.uk/
 WaterP.
 Tools.
Uses the Smith-Waterman algorithm (1981)
 MatcherP.
Can be used to find various local alignments
between 2 sequences. Slower than WaterP.
 Server at EBI (Smith-Waterman algorithm).
 http://www.ebi.ac.uk/emboss/align/index.html
Heuristic Algorithms
 Motivations:
 Exact
algorithms are exhaustive but computationally
expensive.
 Exact algorithms are impractical for comparing a query
sequence to millions of other sequences in a database
(database scanning),
 and so, database scanning requires faster alignment
algorithm (at the cost of optimality).
Heuristic Algorithms
 Probing a database with a query is similar to aligning a query with a
very long sequence.
 need fast local alignment methods.
 Main idea:
 Use
dynamic programming, but limited to (sub-)sequences which are
likely to produce interesting alignments with the query.
 Heuristic
part of the algorithm: eliminate from search uninteresting
sequences (need to make a guess).
 Algorithms:
 FASTA :
Lipman-Pearson (1985).
 BLAST (Basic Local Alignment Search Tool)
: Altshul et al. (1990).
BLAST Overview
 Many versions for different query-database cases:
 blastp:
protein - protein
 blastn:
nucleotide - nucleotide
 blastx:
nucleotide  protein - protein
 tblastn:
protein - protein  nucleotide
 tblastx:
nucleotide  protein - protein  nucleotide
 Comes in many flavours.
 Fast and reliable.
 Easy to use.
BLAST Overview
 BLAST computes “an alignment”, not necessarily the exact optimal
alignment.
 Given the query and the database (long sequence):
 Find
all words of length k (typical: k=4) that match the query with a
score high enough.
 Look
for subsequences in the database that contain these words.
 Extend
subsequences to see if match score can be increased.
 Compute
total score when no more extensions are possible.
 Rank the alignments.
How should the different matched (sub-)sequences be ranked?
Significance of Alignments
 Scores cannot be used to rank alignments:

a bad but long alignment may have a higher score than a good but short
alignment.
 We need a normalized scoring scheme that would allow to
compare alignments, and evaluate their biological significance.
 Idea:

Probe the database with random sequences.

This gives a distribution of scores (it follows the extreme-value distribution).

Establish a threshold for significance.
Extreme-Value Distribution
Score distribution
for random
sequences
probability that the score
of our query is no better
than random: P-value
score
score of our query
Difficulty: finding a significance threshold.
Quantifying the Significance of Alignments
For an alignment with raw score S:
 P-value:
 The
probability of an alignment occurring with score S or better if
the aligned-against sequence is random.
 The lower the P-value, the more significant the alignment.
 E-value:
 Expected
number of alignments with scores equivalent to or better
than S to occur by chance only.
 The lower the E-value, the more significant the alignment.
 E-value = P-value * size of database.
Rough Guide for P-values and E-values
 P-Value (reported by many programs): 0≤ P-val ≤ 1
P<= 10-100
Exact match
10-100 < P < 10-50
Sequences very nearly identical, e.g.: alleles or SNPs
10-50 < P < 10-10
Closely related sequences, homology certain
10-5 < P < 10-1
Usually distant relatives
P>10-1
Match probably insignificant
 E-value (reported by some programs, eg PSI-Blast): 0 ≤ E-val ≤ size of database
E<=0.02
Sequences probably homologous
0.02 <=E <=1
Homology can’t be ruled out
E>1
This match would be obtained by chance
Heuristic Algorithms Servers
 Pairwise alignment:
 BLAST:
http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi
 Database screening:
 FASTA:
http://www.ebi.ac.uk/fasta33/ , SRS, …
 BLAST:
- SRS (at EBI or ...)
-
http://www.ncbi.nlm.nih.gov/BLAST/
http://www.ebi.ac.uk/blast/index.html
http://www.ch.embnet.org/software/bBLAST.html
http://www.ch.embnet.org/software/aBLAST.html
 Evaluating the significance of an alignment:
 PRSS: http://www.ch.embnet.org/software/PRSS_form.html
BLAST Servers
 Blast has many options :
 choice
 basic
of database, substitution matrix, …
or advanced section.
 BLAST interfaces are different:
 NCBI:
 SRS:
excellent help pages and tutorial
easy multiple alignment access
 EMBnet:
simple text + graphical output.
Remark: there is a server with a powerful implementation of Smith-Waterman
for database screening: http://www.ebi.ac.uk/MPsrch/. Runs about 50 times
slower, but is more sensitive and returns less false positives than Blast.
BLAST at NCBI
Let´s submit the query sequence
>1IGR:A INSULIN-LIKE GROWTH FACTOR RECEPTOR
EICGPGIDIRNDYQQLKRLENCTVIEGYLHILLISKAEDYRSYR
FPKLTVITEYSLGDLFPNLTVIRGWKLFYNYALVIFEMTNLKDI
GLYNLRNITRGAIRIEKNADLCYLSTVDWSLILDAVSNNYIVGN
KPPKECGDLCPGTMEEKPMCEKTTINNEYNYRCWTTNRCQKMCP
STCGKRACTENNECCHPECLGSCSAPDNDTACVACRHYYYAGVC
VPACPPNTYRFEGWRCVDRDFCANILSAESSDSEGFVIHDGECM
QECPSGFIRNGSQSMYCIPCEGPCPKVCEEEKKTKTIDSVTSAQ
MLQGCTIFKGNLLINIRRGNNIASELENFMGLIEVVTGYVKIRH
SHALVSLSFLKNLRLILGEEQLEGNYSFYVLDNQNLQQLWDWDH
RNLTIKAGKMYFAFNPKLCVSEIYRMEEVTGTKGRQSKGDINTR
NNGERASCESDVDDDDKEQKLISEEDLN
at http://www.ncbi.nlm.nih.gov/BLAST/
We paste our sequence
here and launch the
search
substitution
matrix
Conserved domains
Graphical overview
of hits – couloured
according to
similarity
Hits
Alignment for each
of the hits
Bit score: S’
The value S’ is derived from the raw
alignment score S, but statistical
properties of the scoring system have
been taken into account. Because bit
scores are normalised w.r.t. scoring
system, they can be used to compare
alignment scores from different
searches.
E value: Expectation value.
Expected # of alignments with scores
equivalent to or better than S to
occur by chance. The lower the E
value, the more significant the score.
NCBI Blast output help:
http://www.ncbi.nlm.nih.gov/Educatio
n/BLASTinfo/Blast_output.html
BLAST at SRS EBI
SRS EBI: View results using BlastAlignment
Alignments are dispalyed
BLAST at EMBnet
Graphic output on/off
BLAST Variants
 PHI-Blast: Pattern-Hit Initiated Blast:
 Searches
proteins containing a specified pattern AND are similar
to the query sequence in the neighborhood.
 Patterns must follow the syntax of PROSITE.
 PSI-Blast: Position-Specific Iterated Blast:
 More
sensitive, ie better at detecting distant relationships, than
BLAST.
 Computes position-specific substitution matrices (PSSMs) to score
matches between query and database sequences .(Blast uses
precomputed substitution matrices, eg BLOSUM62.)
PSI-BLAST
 Repeatedly searches the target databases.
 At each round:
 compute
a multiple alignment of high scoring sequences to
generate a new PSSM for next round of searching.
 Iterates until no new sequences found (or until a maximal
number of iteration is reached).
Rules of thumb for pairwise alignment
 Use server defaults in the absence of any other information.
 Adjust the substitution matrix to the expected divergence of
the 2 sequences. Use BLOSUM62 if no a priori information.
 For distantly related sequences, use PSI-Blast rather than
BLAST.
 Many ways of aligning 2 sequences.
 A returned
 Inspect
alignment is not the absolute truth.
the alignment from the biologist´s perspective.
PairsDB
 A database of pre-computed Blast and Psi-Blast
alignments.
 Continually updated.
 Source databases: Uniprot, PDB, EMBL, Worm
database, ENSEMBL, NCBI genomes, RefSeq.
PairsDB thus provides a quick and easy way to explore
protein sequences and their relationships.
PairsDB
Seq databases:
- Uniprot
- PDB
- ...
remove
redundancy at 90%
NRDB90
BlastP
all-on-all
A set of
alignments
NRDB80
NRDB70
...
NRDB40
NRDB30
Psi-Blast
all-on-all
NRDB90: non-redundant database at 90%, etc.
A set of
alignments
PairsDB: http://www.csc.fi/cgi-bin/pairsdb/pairsdb.cgi
PairsDB
Multiple sequence alignment
• Motivation
• Algorithms overview
• Clustalw
• Clustal-X
Multiple Sequence Alignment
 Given a set of N ≥ 3 sequences, we want to find the best
way of aligning these sequences simultaneously.
 A multiple alignment does not reflect the level of pairwise
similarity between pairs of sequences.
-----------------NC-----------------------------------------------ACF--------------------------------------------IRGCRL-------------------------------------------MAECWSHGSNSVFPF-------------------VTPSVKPSHASQEVKLHDSTSYAQNPFLSLLGKPIVPAQAPIKPQSKPPS
------------------CEAQ-------------------------------------------VACNLRSLSPVRSPRGFLTG--------------
142
141
147
158
792
142
179
Motivations
 Pairwise sequence alignment is easy with sufficiently
closely related sequences.
 Below a certain level of identity sequence alignment may
become uncertain :
 twilight
zone for aa sequences ~ 30%.
 In or below the twilight zone it is good to make use of
additional information, eg, from evolution.
Motivations
 A multiple alignment of diverse sequences is more
informative than a pairwise alignment:

residues conserved over longer period of time are under
stronger evolutionary constraints.
 Reasons for aligning sets of sequences:

organize data to reflect sequence homology,

estimate evolutionary distance,

infer phylogenetic trees from homologous sites,

highlight variable and conserved sites/regions,

determine substitution frequencies,

pattern/domains identification,

helpful for protein structure prediction.
An alignment of 8 fragments of immunoglobulin:
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWESNG--
 Alignment highlights:
 Conserved
residues: One of the cysteines forming the
disulphide bridges, and the tryptophan.
 Conserved
 Patterns
regions (e.g. Q.PG).
(e.g.: dominance of hydrophobic residues at
positions 1 and 3). The alternating hydrophobicity pattern
is typical for surface beta-strand at the beginning of each
fragment.
Consensus Sequence
 Simplest Form:
A single sequence which represents the most common amino
acid/base in that position
Y D
D
G
A
V
-
E
A
L
Y D
G
G
-
-
-
E
A
L
F E
G
G
I
L
V
E
A
L
F D
-
G
I
L
V
Q
A
V
Y E
G
G
A
V
V
Q
A
L
------------------------------------------------------Y D
G
G
A/I
V/L
V
E
A
L
Multiple Sequence Alignments Algorithms
 Multiple sequence alignment uses heuristic methods only:
 With
dynamic programming, computational time quickly
explodes as the number of sequences increases.
 Different methods/algorithms:
 Segment-based
 Iterative
(DiAlign, T-Coffee…).
(HMMs, SAGA, DiAlign, PRRP, …).
 Progressive
(Clustalw, T-Coffee, PileUp, …).
 ClustalW:
 First
described by D.G. Higgins and P.M.Sharp (1988).
 Can
be used for nucleotide or amino acid sequences.
Clustalw Algorithm
 Step1: Calculate all pairwise alignments and calculate
distances for all pairs of sequences.
 Step 2: Construct guide tree joining the most similar
sequences using Neighbour Joining.
A
B
C
D
B
2
C
4
4
D
6
6
6
E
6
6
6
4
F
8
8
8
8
Step 1
E
8
Step 2
Clustalw Algorithm
 Step 3: From the tree assign weights for each sequence:
 We
want to down-weight nearly identical sequences and up-
weight the most divergent ones.
 Step 4: Align sequences, starting at the leaves of the
guide tree:
 Pairwise
comparisons as well as comparison of single
sequence with a group of sequences (Profile)
Clustalw Algorithm
 Some features:
 Amino
acid substitution matrices are varied at different alignment
stages according to the divergence of the sequences to be aligned.
 Reduced gap penalties in hydrophilic regions encourage new gaps in
potential loop regions rather than regular secondary structure.
Insertions and deletions are
more common in loop regions
than in the core of the protein!
Clustalw
 Clustalw is not optimal.
 There are known areas in which Clustalw performs badly, for
example:
 errors
introduced early cannot be corrected by subsequent
information,
 alignments of sequences of differing lengths cause strange guide
trees and unpredictable effects.
 Use also others, slower but better depending on the situation:
 T-Coffee: http://www.ch.embnet.org/software/TCoffee.html
 DiAlign: http://dialign.gobics.de/
 POA: http://www.bioinformatics.ucla.edu/poa/
 SAGA

... and more at http://helix.nih.gov/apps/bioinfo/msa.html.
ClustalW Servers
 Servers:
 EBI:
http://www.ebi.ac.uk/clustalw/
 SRS: eg,
http://srs.ebi.ac.uk/  tools  multiple alignments
 EMBnet:
http://www.ch.embnet.org/software/ClustalW.html
 Let’s build a multiple alignment for the following sequences :
>query
MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNETGTISISQLNKNVW
VHTELGYFSGEAVPSNGLVLNTSKGLVLVDSSWDDKLTKELIEMVEKKFKKRV
TDVIITHAHADRIGGMKTLKERGIKAHSTALTAELAKKNGYEEPLGDLQSVTNLK
FGNMKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSASSKDLGNVADAYV
NEWSTSIENVLKRYGNINLVVPGHGEVGDRGLLLHTLDLLK
>gi|2984094
MGGFLFFFLLVLFSFSSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAY
FYVADDGVLVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFR
EVGAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNLNVYLQ
VGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLPFLGSGNSRTWL
VCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKYIKDLRETIRKLYEEGCDVE
CVRERINEELIKIDPSYAQVPVFFNVNPVNAYYVYFEIENEILMGE
>gi|115023|sp|P10425|
MKKNTLLKVGLCVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVW
VHTELGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTD
VIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQTVTNLKFGNTK
VETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNLGNVADAYVNEWSTSIE
NMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK
>gi|115030|sp|P25910|
MKTVFILISMLFPVAVMAQKSVKISDDISITQLSDKVYTYVSLAEIEGWGMVPSNGM
IVINNHQAALLDTPINDAQTEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQR
KGVQSYANQMTIDLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIV
VWLPTENILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGH
GDYGGTELIEHTKQIVNQYIESTSKP
>gi|282554|pir||S25844
MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRARRLAEWV
DKLAAGPGRTVVNTHFHGDHAFGNQVFAPGTRIIAHEDMRSAMVTTGLALTGLWP
RVDWGEIELRPPNVTFRDRLTLHVGERQVELICVGPAHTDHDVVVWLPEERVLFAGD
VVMSGVTPFALFGSVAGTLAALDRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWV
QRLAADAVDRRLTPLQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAM
EIFAELVAYNGGQLPTCLA
ClustalW at EBI
 Many options:
 CPU
mode,
 full/fast
alignment,
 window
length in fast mode,
…
 gap
penalties.
ClustalW at EBI
 Automatic display of:

Score table

Alignment (optional colouring)

Tree guide
 Link to Jalview alignment editor!
(More on Jalview at end of week.)
Running Clustalw from SRS (Columbia University)
Running Clustalw from SRS
View results using:
*complete entries*
View results using:
ClustalwAli
Clustal-X
 Windows or Linux interface for the ClustalW multiple sequence
alignment program.
 Integrated environment for performing multiple sequence and
profile alignments and analyzing the results.
 A versatile coloring scheme:
 allows
 fully
to highlight conserved features in the alignment,
customizable.
 Does not have as versatile gap penalties options as servers.
 Start with sequences in FASTA format (or an existing alignment
in Clustal format).
 [Do Alignment] on the alignment menu.
Clustal-X
Clustal-X
Using Clustal-X
 Clustal X input: can read FASTA format (and 6 others)
 Output: alignment (coloured) and consensus sequence:


* indicates single, fully conserved residue
: indicates that one of the following ‘strong’ groups is fully conserved:
STA, NEQK, NHQK, NDEQ, QHRK, MILV, MILF, HY, FYW

. Indicates that one of the following ‘weaker’ groups is conserved:
CSA, ATV, SAG, STNK, STPA, SGND, SNDEQK, NDEQHK, NEQHRK, FVLIM, HFY

Residues are coloured by type by default, but colouring scheme is customizable.
Source: ClustalX help search on google: => http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html
Using Clustal-X with JalView
Proteins: 1MBD (myoglobin), 4HHB-B (hemoglobin), 1ECD (hemoglobin)
• Feed sequences to Clustal-X  compute alignments, trees, ...
• Feed an alignment to JalView  edit the alignment.
The most hydrophobic residues according to this table are coloured red and the
most hydrophilic ones are coloured blue. The colours of the in between residues
are varying shades of purple according to whereabouts they are on the scale.
A note on the example
 It is atypical:


It uses only three sequences.
One should use more in order to extract reliable informations.
 It illustrates a common mistake:


It uses too closely related sequences.
One should use as divergent and diverse sequences as
possible in order to extract relevant informations.
References
 Tutorials:
 Blast: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html
 Clustal-X: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html
 Sequence analysis:
 D.W.
Mount: Bioinformatics, Sequence Analysis and Genome
Analysis. Cold Spring Harbor Laboratory Press, 2004 (2nd
edition)
…