BLAST Workshop - Tel Aviv University

Download Report

Transcript BLAST Workshop - Tel Aviv University

BLAST Workshop
Maya Schushan
June 2009
Workshop OUTLINE
Part 1:
• Introduction and motivation
• How does BLAST work?
Part 2:
• BLAST programs
• Sequence databases
• Work Steps
• Extract and analyze results
Why BLAST?
Finding homologous
• Homology- similarity between sequences that result from a
common ancestor.
• Sequences look alikeMore
 probably
have the same function
then:
and structure.
•
25% for proteins
70%
nucleotides
Use a sequence
as a for
search
query in order to find
homologous
sequences
in a dataas
base.
will be
considered
homologous
• Save time! – exploit the knowledge you have about your
homologues, and conclude about your query.
Why BLAST?
Finding homologous
Identify sequence motifs
Why BLAST?
Finding homologous
Find out which region are evolutionary conserved
 important for function and\or structure
Why BLAST?
Finding homologous
Construct phylogenetic trees  understand the
evolution of the sequence’s family
Why BLAST?
Finding homologous
Inferring function for a novel sequence 
learning from previous data available for
homologous sequences
Why BLAST?
Finding homologous
Finding out if your protein sequence has a
structure (or a close homologue has one….)
How does BLAST work?
What Is An Alignment?
Before we can understand how
BLAST works, we first have to
understand the principles of
sequence alignment….
How does BLAST work?
What Is An Alignment?
• Comparing 2 (pairwise) or more (multiple) sequences.
• Searching for a series of identical or similar
characters in the sequences.
VLSPADKTNVKAAWAKVGAHAAGHG
||| |
|
|||| | ||||
VLSEAEWQLVLHVWAKVEADVAGHG
How does BLAST work?
What Is An Alignment?
A process of lining-up 2 or more sequences to achieve
maximum level of identity, in order to find homologies.
TCATG
CATTG
?
TCATG
CATTG
or
TCATG
CATTG
How does BLAST work?
What Is An Alignment?
S = ACTG
T = AGT
S’ = AC_TG S’ = ACTG S’ = ACTG
T’ = A_GT_ T’ = AGT_ T’ = _AGT
Good: Identical characters- match.
Bad: Different characters- mismatch; gap (InDel).
• Each pair of characters gets a value, depending on its identity.
•The similarity score of the alignment is the sum of pair values.
General
Alignment
Methodology
How
does
BLAST
work?
What Is An Alignment?
Example: Aligning Two Globins
Human Hemoglobin (HH):
VLSPADKTNVKAAWGKVGAHAGYEG
Sperm Whale Myoglobin (SWM):
VLSEGEWQLVLHVWAKVEADVAGHG
How does BLAST work?
What Is An Alignment?
Example: Aligning Two Globins
• Percent identity: 36
• Percent similarity: 40
(HH)
No Gaps:
VLSPADKTNVKAAWGKVGAHAGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGHG
How does BLAST work?
What Is An Alignment?
Example: Aligning Two Globins
With Gaps:
Gaps: 2
• Percent identity: 45.833 (instead of 36 without gaps)
• Percent similarity: 54.167 (instead of 40 without gaps)
•
(HH)
VLSPADKTNVKAAWGKVGAH-AGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
How does BLAST work?
What Is An Alignment?
Alignment Scoring
1. Assume independent mutation model
2. Score at each position
– Positive if the same/similar
– Negative if different or gap
3. Score of an alignment is sum of position score
How does BLAST work?
What Is An Alignment?
Scoring Matrix
• A matrix n  n : n=4 for DNA, n=20 for proteins
• Each entry matrix defines the score for observing the
two letters in the alignment
A G C T
– Positive if likely to change
– Negative otherwise
A 1
G -5 1
C -5 -5 1
T -5 -5 -5 1
How does BLAST work?
What Is An Alignment?
DNA scoring matrices
• Transitions – purine to purine or pyrmidine to pyrmidine
(4 possibilities)
• Transversions – purine to pyrmidine or pyrmidine to purine
(8 possibilities)
• By chance alone transversions should occur twice as often as
transitions.
• De-facto transitions are more frequent than transversions.
How does BLAST work?
What Is An Alignment?
DNA scoring matrices
From
To
A
G
A
G
2
-4
2
C
T
-6
-6
-6
-6
Transversion
C
T
2
-4
2
Transition
Match
How does BLAST work?
What Is An Alignment?
Proteins scoring matrices
• Observation: some substitutions
are more frequent than others,
e.g., chemically similar amino acids
• As for DNA, protein matrices
define the probabilities of change
between the different amino acids
• Popular matrices are based on
empirical data: PAM & BLOSUM
How does BLAST work?
What Is An Alignment?
PAM Matrices
• PAM matrices are based on sequences with 85% identity.
• The changes are “accepted” by natural selection
• 1 PAM unit:
the probability of 1 point mutation per 100 residues.
• Multiplying PAM1 by itself gives higher PAMs matrices that
are suitable for larger evolutionary distance.
How does BLAST work?
What Is An Alignment?
BLOSUM Matrices
• Based on BLOCKS database:
• Low BLUSOM numbers for distant sequences,
High BLUSOM numbers for similar sequence
• BLOSUMn is based on sequences that shared at least n
percent identity, generally:
BLOSUM62 for general use
BLOSUM80 for close relations
BLOSUM45 for distant relations
How does BLAST work?
What Is An Alignment?
Proteins scoring matrices
Closer sequences
PAM100
PAM120
PAM160
PAM200
PAM250
Distant sequences
=
=
=
=
=
BLOSUM90
BLOSUM80
BLOSUM60
BLOSUM52
BLOSUM45
How does BLAST work?
How do we calculate gap scores
-
Same substitution scores are applied on gapped and
ungapped local alignments.
-
Appropriate gap scores have been selected over the
years by trial and error  default gap scores
-
If you wish to apply a different scoring matrixNo grantee that the gap scores will remain appropriate!!!!
-
large penalty for opening and much smaller one for
extending it are most effective
How does BLAST work?
What Is An Alignment?
Scoring
• The final score of the alignment is the sum of the
positive scores and penalty scores:
Scoring
Matrix
+ Number of Identities
+ Number of Similarities
- Number of Gap insertions
- Number of Gap extensions
Alignment score
Gap
penalties
How does BLAST work?
BLAST
(Basic Local Alignment Search Tool)
• Goal: A fast search for homologues in a huge database
The underlying hypothesis: when two sequences are
similar there are short ungapped regions of high
similarity between them
•
The heuristic:
1. Discard irrelevant sequences
2. Perform exact local alignment only with the remaining
sequences
•
Altschul, S.F.,Gish, W., Miller, W., Myers, E.W., and Lipman,D.J(1990) “basic local alignment search
tool” J. Mol. Biol. 215: 403-410
How does BLAST work?
Searching a sequence database
•Idea:
In order to find homologous sequences to a sequence of
interest, one should compute its pairwise alignment against
all known sequences in a database, and detect the best
scoring significant homologs
•Query sequence - the sequence with which we are searching
•Hit – a sequence found in the database, suspected as
homologous
27
How does BLAST work?
The parametersW : Word size – find W-mers in target/query
2-3 for aa, 6-11 for nucleotides.
T : Threshold – focus on pairs scoring >T
usually 11-13
X : Drop-off – stop extending when loss >X
S : Score – the final score of segment pair
How does BLAST work?
The algorithm:
1.
Align a query sequence with the database.
2.
Find “hits”: short word pairs of length W with an
ungapped alignment score of at least T.
3.
Extend alignments until score drops more than X below
hitherto best score
s
Consumes most of the processing
time (>90%)
t
How does BLAST work?
How do we discard irrelevant
sequences quickly?
• Divide the database into words of length w (default:
w = 3 for protein and w = 7 for DNA)
• Save the words in a look-up table that can be
searched quickly
WTDFGYPAILKGGTAC
WTD
TDF
DFG
FGY
GYP …
How does BLAST work?
BLAST: discarding sequences
• When the user enters a query sequence, it is also divided
into words
• Search the database for consecutive neighboring words
• neighbor words are defined according to a scoring matrix
(e.g., BLOSUM62 for proteins) with a certain cutoff level
GFC (20)
GFB
GPC (11)
WAC (5)
How does BLAST work?
Look for a seed: hits on
the same diagonal which
can be connected
Neighbor word
Database
record
At least 2 hits on the same
diagonal with distance which
is smaller than a
predetermined cutoff
This is the filtering stage –
many unrelated hits are
filtered, saving lots of time!
Query
How does BLAST work?
Try to extend the alignment
• Stop extending when the score of the alignment
drops X beneath the maximal score obtained so far
• Discard segments with score < S
ASKIOPLLWLAASFLHNEQAPALSDAN
JWQEOPLWPLAASOIHLFACNSIFYAS
Score=15
Score=17
Score=14
How does BLAST work?
Two-Hit Gapped BLAST
The new gapped BLAST algorithm:
1. Start with the two hit method(a) find two hits of score higher then T,
within a distance A.
(b) invoke an ungapped extension on the second hit.
2.
If the HSP generated has an expected score:
(a) Trigger a gapped extension
(b) If the final score has a significant E-value – report
the gapped alignment.
How does BLAST work?
The result – local alignment
• The result of BLAST will be a series of local alignments
between the query and the different hits found
How does BLAST work?
The scoring system
•
BLAST uses BLOSSOM62 as the scoring matrix to
perform the alignment (default).
How does BLAST work?
E-value
•
To asses the bits score we calculate E-value:
E-value = The expected number of HSP’s with a score
of at least S:
 s
E  KMNe
•
For each score S there is a specific E-value.
Small E-value  better score
How does BLAST work?
E-value
Theoretically, we could trust any result with an
E-value ≤ 1
In practice – BLAST uses estimations.
•E-values of 10-4 and lower indicate a significant homology.
•E-values between 10-4 and 10-2 should be checked (similar
domains, maybe non-homologous).
•E-values between 10-2 and 1 do not indicate a good homology
How does BLAST work?
PSI-BLAST
Step 1:
1. Set a standard protein-protein BLAST search (BLOSUM62)
2. Build a position specific scoring matrix (PSSM)
according to MSA of the alignment results with low Evalue.
Step 2:
1. Set a BLAST search using the PSSM to evaluate the
alignment. PSSM vs. DB instead of seq vs. DB
2. Update the PSSM according to the new result
3. Go back to the beginning of step two or stop.
How does BLAST work?
PSI-BLAST
The difference1
2
3
4
5
6
7
8
9
A
.1
.3
.3
.2
.3
.1
.8
0
.3
D
.3
.3
.6
.2
.4
.2
.2
.1
.3
L
.6
.4
.1
.6
.3
.7
0
.9
.4
•
The score for aligning a letter with a pattern position is
given by the matrix itself!
•
The matrix is of the length of the original seq. (L* 20)
•
No theory for deriving gap costs  Gap scores the same
as the one in the 1st iteration
How does BLAST work?
The power of PSI-BLAST:
1.
A much sensitive scoring system .
each position has its own pattern probabilities .
2.
Different weight to conserved positions.
3.
Important motifs are bounded
4.
Lowers the level of random noise.
5. Finding distant relatives.
How does BLAST work?
Lets sum up…
-
Blast is a fast way to find homologues
-
No analytic theory that estimates the
statistical significance of gapped alignments
-
Gap scores have been selected by trial and
error.
applying different scoring matrix  No grantee
for gap scores
-
PSI-BLAST finds weak homologues fast