Transcript Chen-BLAST

BLAST: Database Search
Heuristic Algorithm
Some slides courtesy of Dr. Pevsner and Dr. Dirk
Husmeier
What is BLAST?
BLAST (Basic Local Alignment Search Tool)
allows rapid sequence comparison of a query
sequence against a database.
Smith-Waterman is rigorous and it is guaranteed
to find an optimal alignment.
But also time and space consuming. It is especially
inefficient in database searches.
BLAST provides a rapid alternative.o S-W
Why do we use BLAST?
To understanding the relatedness of any protein
or DNA sequence (query sequence) to other
known sequences (database)




Identify sequences with a common ancestor
(orthologs) and paralogs
Discover new genes or proteins
Explore protein structure and function
…
The BLAST Algorithm
S. F. Altschul, et al., 1997, Nucleic Acids
Research, 25:3389
“The central idea of the BLAST
algorithm is to confine attention
to segment pairs that contain a
word pair of length w with a score
of at least T.”
Altschul et al. (1990)
How the original BLAST algorithm works:
Step 1. size w words
in the query sequence
Look at the query sequence by a moving
window of size w
Example: for a human RBP query
…FSGTWYA… (query word is in yellow)
The moving window of words:
FSG SGT GTW TWY WYA
page 101
Step 1: compile a list of words
scoring at least T with query word
word hits
> threshold
(T=11)
word hits
< threshold
GTW
ASW
ATW
NTW
GTY
GNW
GAW
6,5,11
6,1,11
0,5,11
0,5,11
6,5,2
22
18
16
16
13
10
9
2. Scan the database for entries that contains any word from
the compiled hit list.
3. Extend: when you manage to find a “hit”
extend the hit in either direction.
Keep track of the score (use a scoring matrix)
Stop when the score drops below some cutoff.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)
MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)
extend
Hit!
extend
Alignment Score
It is important to assess the statistical significance
of search results.
For local alignments (including BLAST search results),
the scores follow an extreme value distribution (EVD).
E-value is closely related to the analysis of the
distribution of alignment score
Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the
statistical significance of molecular sequence features by using
general scoring schemes." Proc. Natl. Acad. Sci. USA 87:22642268.(PubMed)
Alignment score as a random walk
Max Score in an Excursion
Pi : frequency of residue i in 1st seq.
Pk’ : frequency of residue k in 2nd seq.
Protein Scoring
Dist’n of any excursion Y
an exponential distribution
The Max of n variables


Y1, Y2, …, Yn are identical &
independently distributed
Ymax is the max of the above all. Then
Prob(Ymax<y)=(Prob(Y<y))n
The Max of n Exponential
variables


Y1, Y2, …, Yn are independent
exponential variables
Ymax is the max of the above all. Then
Prob(Ymax<y)=(Prob(Y<y))n
=(1-e-λy)n
Prob(Ymax>y)=1-(Prob(Y<y))n
=1-(1-e-λy)n
In a database of n seq.s





Number of sequences: n
Y1, Y2,…, Yn are i.i.d. exponential
What happens when n is large?
Using a widely used rule: (1+x/n)nexp(x)
1-exp(-nCeλy)
Probability of scores/excursion higher than y
The distribution of Ymax follows an extreme value dist’n
the sum of a large number of independent
identically distributed (i.i.d) random variables
tends to a normal distribution,
0.40
0.35
probability
0.30
0.25
0.20
normal
distribution
0.15
0.10
0.05
0
-5
-4
-3
-2
-1
0
x
1
2
3
4
5
the maximum of a large number of i.i.d. random
variables tends to an extreme value distribution
0.40
0.35
probability
0.30
0.25
0.20
normal
distribution
extreme
value
distribution
0.15
0.10
0.05
0
-5
-4
-3
-2
-1
0
x
1
2
3
4
5
Expected number of better
scores/higher excursions
E-value
1-exp(-nCeλy) =
p-value
E = Kmn
-lS
e
E is the number of hits you would expect from your search
with scores greater than S where:
K is a constant
m is the size of the query
n is the size of the database being searched
l scales for the specific scoring matrix used (decay constant
from the extreme value distribution)
How to interpret BLAST: E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to interpret
than corresponding p values.
E
10
5
2
1
0.1
0.05
0.001
0.0001
p=1-exp(-E)
0.99995460
0.99326205
0.86466472
0.63212056
0.09516258 (about 0.1)
0.04877058 (about 0.05)
0.00099950 (about 0.001)
0.0001000

End
Extreme value distribution

The distribution:

The area to the right of S:

Yev  exp  x  e  x


PS  x   1  exp  e  x


Scaling to a particular type of score:

PS  x   1  exp  e l  x  

where μ is the mode and λ is a scale
factor.
Extreme value distribution
Compute this

The distribution:
value for x=0.

Yev  exp  x  e  x


The area to the right of S:

Scaling to a particular type of score:

PS  x   1  exp  e  x


PS  x   1  exp  e l  x  

where μ is the mode and λ is a scale
factor.
Extreme value distribution
Compute this value for x = 0.

The distribution:
Solution: exp[-1] = 0.368

Yev  exp  x  e  x


The area to the right of S:

Scaling to PaS particular
type
l    of score:

 x   1  exp  e

PS  x   1  exp  e  x


x
where μ is the mode and λ is a scale
factor.
An example
You run BLAST and get a score of 45.
You then run BLAST on a shuffled
version of the database, and fit an
extreme value distribution to the
resulting empirical distribution. The
parameters of the EVD are μ = 25 and
λ = 0.693. What is the p-value
associated with 45?
An example
You run BLAST and get a score of 45. You then run BLAST on a
shuffled version of the database, and fit an extreme value distribution
to the resulting empirical distribution. The parameters of the EVD are
μ = 25 and λ = 0.693. What is the p-value associated with 45?
0.693 45 25



 1  exp  e
 1  exp  9.565 10 
PS  45  1  exp  e
13.86
7
 1  0.999999043
 9.565 10 7
Another example
You run BLAST and get a score of 23. You then run BLAST on a
shuffled version of the database, and fit an extreme value distribution
to the resulting empirical distribution. The parameters of the EVD are
μ = 20 and λ = 0.744. What is the p-value associated with 23?
Another example
You run BLAST and get a score of 23. You then run BLAST on a
shuffled version of the database, and fit an extreme value distribution
to the resulting empirical distribution. The parameters of the EVD are
μ = 20 and λ = 0.744. What is the p-value associated with 23?
0.744 23 20 

 1  exp  e
PS  23  1  exp  e
 2.232

 1  exp  0.10731
 1  0.8982
 0.1017

BLAST: optional parameters
You can...
• choose the organism to search
• turn filtering on/off
• change the substitution matrix
• change the expect (e) value
• change the word size
• change the output format
Choosing Gap Penalty
1.
2.
3.
4.
5.
Choice must be made corresponding to each type of scoring
system to place gaps where they will increase the overall
alignment score.
There are no hard and fast rules for choosing gap penalties.
Both g (opening penalty) and r (extension penalty) should be
non-zero.
The value of g + r should be greater than the maximum
score used for a match if insertions and deletions are
considered to be rarer than nucleotide substitutions.
The value of g strongly influences the number of gaps
introduced into a region separating two closely matching
regions.
Comparing Scoring Matrix
PAM



Homologous seq.s during
evolution
Based on extrapolation of a
small evol. Period
Track evolutionary origins
BLOSUM



Conserved blocks
Based on a range of
evol. Periods
Find conserved domains
Another way to compare

perform a search of a sequence
database with a known member of a
protein family and to find how many
members of the family are found. When
gap penalty was not considered, the
BLOSUM62 matrix outperformed the
PAM250 matrix in finding more
members of 504 different families on
the Prosite database.
BLAST Phase 3
Extension:
In the original (1990) implementation of BLAST,
hits were extended in either direction.
In a 1997 refinement of BLAST, two independent
hits are required. The hits must occur in close
proximity to each other. With this modification,
only one seventh as many extensions occur,
greatly speeding the time required for a search.
How a BLAST search works: threshold
You can modify the threshold parameter.
The default value for blastp is 11.
To change it, enter “-f 16” or “-f 5” in the
advanced options.
How to interpret a BLAST search: expect value
The expect value E is the number of alignments
with scores greater than or equal to score S
that are expected to occur by chance in a
database search.
An E value is related to a probability value p.
The key equation describing an E value is:
E = Kmn e-lS
E = Kmn e-lS
This equation is derived from a description
of the extreme value distribution
S = the score
E = the expect value = the expected number
of HSPs with a score >= S
m, n = the length of two sequences
l, K = Karlin Altschul statistics
Some properties of the equation E = Kmn e-lS
• The value of E decreases exponentially with increasing S
(higher S values correspond to better alignments). Very
high scores correspond to very low E values.
•The E value for aligning a pair of random sequences must
be negative! Otherwise, long random alignments would
acquire great scores
• Parameter K describes the search space (database).
• For E=1, one match with a similar score is expected to
occur by chance. For a very much larger or smaller
database, you would expect E to vary accordingly
From raw scores to bit scores
• There are two kinds of scores:
raw scores (calculated from a substitution matrix) and
bit scores (normalized scores)
• Bit scores are comparable between different searches
because they are normalized to account for the use
of different scoring matrices and different database sizes
S’ = bit score = (lS - lnK) / ln2
The E value corresponding to a given bit score is:
E = mn 2 -S’
Bit scores allow you to compare results between different
database searches, even using different scoring matrices.
How to interpret BLAST: E values and p values
The expect value E is the number of alignments
with scores greater than or equal to score S
that are expected to occur by chance in a
database search. A p value is a different way of
representing the significance of an alignment.
p = 1 - e -E