Lecture4 - Department of Computer Science

Download Report

Transcript Lecture4 - Department of Computer Science

BLAST and searching sequence
databases
Dr Alexei Drummond
Department of Computer Science
[email protected]
BIOSCI 359, Semester 2, 2006
Sequence database searching
Sequence Homology
• Homologous protein or DNA
sequences share common
ancestry
– A statement of homology is
therefore an evolutionary
hypothesis
• Homology need not imply similar
function
• Homology is a binary property, a
pair of sequences are either
homologous or not homologous.
x
a, b
homologous
t
a
b
x
y
a, b not
homologous
– No such thing as degree of
homology
• Homology is often inferred by
sequence similarity
a
b
2
Sequence database searching
Orthology and paralogy
"Where the homology is a result of gene duplication so
that both copies have descended side by side during
the history of an organism, (for example, alpha and
beta hemoglobin) the genes should be called
paralogous (para=in parallel). Where the homology is
the result of speciation so that the history of the gene
reflects the history of the species (for example alpha
hemoglobin in man and mouse) the genes should be
called orthologous (ortho=exact). "
Fitch WM. Distinguishing homologous from analogous
proteins. Systematic Zoology 1970 Jun;19(2):99-113.
3
Sequence database searching
Orthology and paralogy
4
Sequence database searching
Orthology, paralogy and multigene families
Reproduced from NCBI education website
5
Sequence database searching
What are good scores for searching
databases?
• We want these scores to distinguish the related
sequences from the unrelated sequences
• So we select alignment parameters for
database searching that give us the best
distinguishing scores
• These may not be the parameters that will give
us the most accurate alignment
6
Sequence database searching
Database searching is an experiment
• “Database searching is the application of
knowledge gained from previous experiments
to the problem of discovering the
biochemistry and physiology of a newly
discovered gene or its protein.”
• It demands the same careful thought and
execution as your bench or laboratory
investigations!
• Garbage in, garbage out
7
Sequence database searching
Sources of previous knowledge
• Similarity scores - PAM, Blosum similarity
matrices
– Rescue us from having to assume that all amino
acid changes are equally likely and equally harmful
– Different similarity matrices are appropriate for
different degrees of evolutionary divergence
– More later
8
Sequence database searching
Second source of previous knowledge
• Computer Algorithm
– Dynamic programming is most sensitive and the
least selective
• Local, global, repeats, overlap
– BLAST and FASTA are much faster and more
selective which can be an advantage
– No program is always best at finding distantly
related sequences for all gene or protein
families, but dynamic programming is guaranteed to
give optimal alignment for given scores
9
Sequence database searching
Third source of previous knowledge
• Database itself
– Large store of previously acquired knowledge
– Making the best use of this knowledge can save
you many months of expensive laboratory
experimentation
– The size of this potential gain is the determining
factor in deciding how much effort to devote to any
particular database search
10
Sequence database searching
Database search assumptions
• The sequences sought have an evolutionary
ancestral sequence in common with the “query
sequence”
• All substitutions are not equally likely and
should be weighted to account for this
• Insertions and deletions are less likely than
substitutions and should be weighted to
account for this.
11
Sequence database
FASTA searching
FASTA
• 2 step algorithm
– (1) word search, using a specific word size, finds
regions with a high number of identical word
matches
– (2) Smith-Waterman alignment centered on these
regions and bounded by a window size which
limits the number of insertions or deletions one
sequence can accumulate with respect to the
other sequence
12
FASTA Heuristics
Sequence database
FASTA searching
• Heuristic approximation to Smith-Waterman
– Runs faster
– Loses some sensitivity
• Restrictions on the model of sequence
evolution
– First Heuristic - Word size parameter -usually 2 for
proteins and 6 for nucleic acids - FASTA constrains
the evolution between a pair of sequences to
preserve a number of unchanged dipeptides or
hexanucleotides
13
Sequence database
FASTA searching
FASTA Algorithm
• divide query sequence into its constituent
overlapping words of length two for proteins or
six for nucleic acids
• each sequence in the database is also broken
up in the same way
• Two word lists are compared to find all identical
words in both sequences
• An initial score is computed based on how
many identities are based within a small region
of the dot plot.
14
Sequence database
FASTA searching
FASTA Algorithm cont.
• If the initial score is high enough, a second score is
computed by evaluating which initial identities can be
joined into a consistent alignment using only gaps of
less then the window size.
• If the secondary score is high enough, then a SmithWaterman alignment is performed within the same
region of the dot plot defined by the concentrated
identities and using the same window-size.
• This third score is reported as the optimal score.
15
Sequence database
FASTA searching
Creating a Word List
g
g
c
c
c
t
t
t
t
g
g
g
g
g
g
g
g
g
g
g
a
a
a
a
a
a
a
a
g
g
c
a
t
a
a
a
a
a
a
g
g
g
g
g
g
g
g
g
c
c
c
a
a
t
16
Sequence database
FASTA searching
First Pathological Example
• Two proteins that share 50% identity - but the
proper alignment consists of alternating match
and mismatches.
• With a word size of two, there would be no
matches along the main diagonal of the dot plot
and the proper alignment would not be found.
17
Sequence database
FASTA searching
Second Pathological Example
• Two proteins that are almost identical, except
the second protein has a 20 residue insertion
into the middle of the sequence.
• If the window size is 15, then the SmithWaterman alignment phase of FASTA will align
the protein to either the sequence prior to or
following the insertion, thus missing the fact
that the proteins were basically identical (with
only one long insertion).
18
Sequence database
FASTA searching
Second Heuristic
• Window size is the second heuristic used by FASTA.
• Its effect is more variable then word size.
• If the best alignment, as defined by a full SmithWaterman analysis, goes outside the window then a
lower scoring alignment will be found by FASTA. This
will lead users to conclude the sequences are not
homologous when in fact they are and the homology
could have been inferred from a full Smith-Waterman
alignment.
• In practice these pathological cases are very unlikely.
However similar cases do occur and loss of sensitivity
caused by the use of these heuristics will be seen.
19
Sequence database
BLAST searching
BLAST
• Approximates a simplification of SmithWaterman known as the maximal segment
pairs (MSP) algorithm.
• MSP alignments do not allow gaps and are
specified by an equation that includes only
the first and fourth terms of the SmithWaterman equation.
• MSP alignment’s statistics are well
understood and so we can compute a
significance probability.
20
Sequence database
BLAST searching
Significance Probabilities
• Thus the evolutionary model requires that there be a
fairly long stretch of sequence that has evolved without
insertions or deletions, or at least with a complimentary
pattern of insertions and deletions that do not
significantly disrupt the alignment
• Recent advances in MSP statistics allow the use of
several independent segment alignments to be used in
evaluating significance probability.
21
Sequence database
BLAST searching
Brief Comparison of BLAST and FASTA
• BLAST is less sensitive than Smith-Waterman but
therefore more selective.
• For proteins BLAST is more sensitive than FASTA even
though BLAST uses a word size of 3 for proteins while
FASTA uses a word size of 2.
• BLAST uses a word size of 11 for nucleic acids. The
recent modifications which make it more sensitive for
proteins do not seem to work for nucleic acids.
• So therefore FASTA should be used instead of BLAST
when searching for nucleic acids.
22
Sequence database
BLAST searching
BLAST Algorithm
• It creates a word list (same as FASTA).
• Then it expands this list in order to recover sensitivity
lost by only using exact matches.
• Any word that scores at least a minimum threshold
(T) when aligned with any of the initial list of words is
added to the list.
• BLAST than examines the database for words that
exactly match any word in the expanded word list.
• Equivalent to looking for gapless alignments of score
at least T in the database.
23
Sequence database
BLAST searching
BLAST example
• The example shows an expanded list of 47 words from
the original 7. The expanded list contains any word that
scores at least eight when aligned with the initial word
and scored with the PAM 120 similarity table.
q
q
l
l
l
n
n
n
f
f
f
s
s
s
a
a
a
g
g
g
w
w
Initial
Word
ql
ln
nf
fs
sa
ag
gw
Expanded List
ql,qm, hl,zl
ln,lb
nf,af,ny ,df,qf,ef, gf,hf,kf,sf,tf,bf,zf
fs,fa,fn,fd,fg,fp,ft,fb,ys
no words score more than 8 including sa
ag
gw,aw,rw,nw,dw ,qw,ew,hw ,iw,kw,mw,
pw,sw,tw,vw ,bw,zw,xw
24
Sequence database
BLAST searching
BLAST Paradox
• Notice that there is no word that scores 8 or
more when aligned with the initial word “sa”,
even the word “sa” itself.
• This situation does occur in actual BLAST
searches.
• The user has the option to force the initial word
into the final list.
25
Sequence database
BLAST searching
BLAST default
• The default is to not include such low scoring
words because they contain so little information
that they are unlikely to be critical in finding a
maximal segment pair alignment.
• BLAST has a word length of 3 for protein
searches with a threshold score of T=13 using
the Blosum62 similarity scoring matrix.
26
Sequence database
BLAST searching
Blast final step
• The occurrence of a word hit is followed by an attempt
to find a locally optimal ungapped alignment.
• This is accomplished by accumulating the score as the
alignment is extended in both directions.
• When a run of mostly negative scores is encountered,
the cumulative score will drop substantially. When this
happens it is unlikely that the score will rebound.
• This observation provides the basis for an additional
heuristic whereby the extension of a hit is terminated
when the reduction in score exceeds a dropoff
threshold.
• The local alignment with the highest score is returned.
27
Sequence database
BLAST searching
Improvements to BLAST
• The growth of the sequence database has
raised the minimum score and hence the
length of alignment that must be found by
BLAST for a match to be significant.
• Speed and sensitivity can be improved by
requiring the algorithm find two matches
above some (lower) threshold rather that one
match. Both matches must be on the same
diagonal.
28
Sequence database
BLAST searching
New BLAST settings
• The increase in speed results from fewer
sequences which are completely evaluated.
• BLAST now looks for 2 words of length 3, that
each score at least 11 using Blosum62. The
matches must be within 40 amino acids on the
same diagonal.
• As the database grows new techniques will
need to be constantly devised.
29
Sequence database
BLAST searching
Gapped BLAST
• Builds the alignment out from a central high scoring pair
of aligned amino acids analogous to the way BLAST
extends the initial maximal segment pair alignment.
• The initial pair of amino acids is chosen as the middle
pair of the highest scoring window of 11 amino acids.
• Smith-Waterman alignment is extended in all directions
in the path graph until it falls below a fixed percentage
of the highest score yet computed in the SmithWaterman phase.
30
Sequence database
BLAST searching
Guarantees on Gapped Blast
• Will find the best scoring Smith-Waterman
alignment if:
– The calculation is extended until a score of 0 is
reached. Stopping earlier accepts a small risk of
not finding the complete alignment in return for a
very large savings in computer resources.
– The initial pair of amino acids selected as the
midpoint must actually be part of the alignment the
would be reported from a full Smith-Waterman
alignment.
31
Sequence database
BLAST searching
BLAST Warning
• Before publishing an alignment: prudent to do a
complete Smith-Waterman analysis.
• Further it is suggested to make use of the
Waterman-Eggert extensions to SmithWaterman (MaxSegs algorithm) in order to look
at the best several independent local
alignments and to examine each sequence for
repeated motifs.
32
Sequence database
BLAST searching
Finding Distant Homologies
• Many functionally and evolutionary important protein
similarities are recognizable only through comparison
of three-dimensional structures.
• When not available, patterns of conservation
identified from the alignment of related sequences
can aid the recognition of distant similarities
• These patterns are called motifs, profiles, positionspecific score matrices, and Hidden Markov Models.
33
Sequence database
BLAST searching
PSI-BLAST
• Position-Specific Iterative BLAST
• Designed to detect weak relationships by using a profile
that is constructed automatically from the multiple
alignment of the highest scoring hits in the initial BLAST
search.
• The profile is created by calculating position-specific
scores for ever amino acid at every position in the
alignment.
34
Sequence database
BLAST searching
How it works
• If a residue is highly conserved at a particular
position, it will receive a high score, and others
will be assigned high negative scores.
• At weakly conserved positions all residues
receive scores near 0.
• Position specific scores can also be assigned to
potential insertions and deletions
35
Sequence database
BLAST searching
Iteration
• The power of profile methods can be further
enhanced through iteration of the search procedure.
• After a profile is run against a database, new similar
sequences can be detected. In each iteration:
– A new multiple alignment, which includes these new
sequences can be constructed.
– A new profile abstracted.
– A new database search performed.
• The procedure can be iterated as often as desired or
until convergence (when no new statistically
significant sequences are detected).
36
Sequence database
BLAST searching
Design Goals of PSI-BLAST
• Speed, simplicity, automatic operation
• Unlike most profile-based search methods, PSIBLAST runs one program starting with a single
protein sequence as input, and the intermediate
steps of multiple alignment and profile
construction are invisible to the user.
37
Sequence database
BLAST searching
PSI-BLAST Details
• It uses the gapped BLAST program for the database
searches. A PSI-BLAST query is identical to a
Gapped BLAST query with the addition of an
expectation value cut-off for inclusion of a match in
an iteration.
• The E-value cut-off can be over-ridden by the user on
a case-by-case basis if a sequence hit of interest is
worse then the threshold. (default is 0.001)
• The multiple alignment and profile will have lengths
identical to that of the query
38
Sequence database
BLAST searching
Notes on using PSI-BLAST
• The WWW version requires the user to decide after each iteration
whether to continue. It has the advantage that the user can handpick sequences used for each profile construction, regardless of Evalue, by checking boxes next to the sequences descriptions.
• A stand-alone version of PSI-BLAST, obtainable from NCBI, allows
the user to run the program for a chosen number of iterations or
until convergence.
• This version also allows the user to save the profile produced and
use it subsequently to search another database.
39
Sequence database
BLAST searching
Warnings on using PSI-BLAST
• PSI-BLAST is a powerful tool and it requires
caution.
• The sources of error are the same as for
standard BLAST, but are easily amplified by
iteration!
40
Sequence database
BLAST searching
Sources of Errors
• The major source of deceptive alignments is the
presence within proteins of regions with highly biased
amino acid composition - low complexity.
• If such a region is included during production of a
profile, otherwise unrelated sequences containing
similarly biased regions will creep into subsequent
iterations, rendering the search nearly worthless.
41
Sequence database
BLAST searching
How to stop bias
• PSI-BLAST filters out biased regions of query sequences by
default, using the SEG program.
• SEG parameters are set to avoid masking potentially important
regions, so some bias may still persist. So PSI-BLAST can still
generate compositionally rooted artifacts.
• These cases can usually be identified by inspection - especially
when sequences that have a known bias, such as myosins or
collagens, are retrieved.
• SEG can also be set to eliminate nearly all biased regions, or
filtering procedures, such as COILS, can be used before
submitting the appropriately masked sequence to PSI-BLAST.
42
Sequence database
BLAST searching
PHI-BLAST
• Pattern Hit Initiated BLAST searches for particular
patterns in protein queries
• It takes a protein query and a pattern contained in that
sequence as input.
• It searches the database for protein sequences that
– contain the input pattern and also
– have significant similarity to the query sequence in the region of
the pattern occurrences.
43
Sequence database
BLAST searching
PHI-BLAST and PSI-BLAST
• The statistical significance of PHI-BLAST is reported
using E-values like other forms of BLAST, but the
method for computing the E-values is different.
• PHI-BLAST is integrated with PSI-BLAST so the results
of a PHI-BLAST can be used to initiate one or more
iterations of PSI-BLAST searching.
• PHI-BLAST is under development and may change
substantially over time.
44
searching
database
Sequence
Matrices
Similarity
Which Similarity Matrix to Use?
• Database searches or sequence alignments perform much better if
the similarity matrix is based on replacement patterns that
correspond to the degree of divergence of the sequences being
aligned or discovered.
• In database searching, a PAM or Blosum matrix corresponding to
an inappropriate degree of divergence can cause you to fail to
discover homologous sequences that are present in the database.
• Therefore a thorough database search will involve using at least 2
and most likely 3 different matrices.
• Using different matrices usually has a higher payoff than using
different programs and search algorithms.
45
searching
database
Sequence
Matrices
Similarity
Comparable Blosum and PAM Matrices
Blosum
Matrix
Blosum 90
Blosum 80
Blosum 60
Blosum 52
Blosum 45
Entropy PAM
Matrix
1.18
PAM 100
0.99
PAM 120
0.66
PAM 160
0.52
PAM 200
0.38
PAM 250
Entropy Sequenc e Identit y
PAM Matrix
1.18
43
0.98
38
0.70
30
0.51
25
0.36
20
46
searching
database
Sequence
Matrices
Similarity
What the Comparability Table Means
• The comparability is based on matrix entropy. Entropy
is defined by information theory as the average amount
of information per position in a sequence alignment that
is available to determine whether or not a sequence is
homologous.
• This amount of information is available only if the matrix
used in the database search is matched for the
appropriate degree of sequence divergence.
• As will be shown later this can be used to get a rough
indication of whether or not a specific database search
result is significant.
47
searching
database
Sequence
Matrices
Similarity
Scores for nucleic acids
• If possible use an amino acid sequence for a database
search because:
– There is redundancy in the genetic code with up to 6 codons
translated as the same amino acid so there is more information
in amino acid sequences once the sequences have diverged
beyond about 50 PAMs (~60% identical)
– Compositional bias found in many organisms and organelles
– Some nucleic acid sequences are derived from messengers
while others are genomic DNA with exons and the introns may
be too short to give a significant alignment with a messenger
derived sequence.
48
searching
database
Sequence
Matrices
Similarity
Search with Nucleic Acids
• There are circumstances when there is no choice but to
search with nucleic acid sequences. (for 98.9% of the
human genome!)
• BLAST uses a very long word size, 11, for nucleic acids
and the modifications to the heuristic to improve
sensitivity for protein sequences do not work as well for
nucleic acids because they have only a four letter
alphabet and the similarity scores are usually calculated
with equal rates of replacement for all of the
nucleotides.
• Thus FASTA is more sensitive than BLAST for nucleic
acid sequences and should be used instead of BLAST
if you want to use one of the faster searching programs.
49
searching
database
Sequence
Matrices
Similarity
Nucleic Acid Matrices
• There are matrix of replacements for nucleic acids just
as has been recommended for proteins.
• Commonly used are PAM 47 scores assuming equal
rates of transitions and transversions. This assumption
leaves us with only two scores, 5 for identities or
matches and -4 for nonidentities or mismatches.
• It is possible to create nucleic acid scores that do not
assume equal rates of transitions and transversions.
– For example assumes a three to one transition to transversion
ratio might be more appropriate than the defaults
50
searching
database
Sequence
and deletions
Insertions
Scoring Insertions and Deletions
• In most alignment and search programs, the gap
penalty consists of two terms, the cost to open the gap
and the cost to extend the gap.
• This form of gap penalty has been shown to give better
results in searching and alignments then a form of
penalty that includes only one of the terms.
• The selection of appropriate scores for insertions and
deletions, the gap penalties, is as important as
selecting the similarity scores for the success of a
database search.
51
searching
database
Sequence
and deletions
Insertions
Purpose of Gap Penalties
• The intuitive choice of gap penalties is to use one designed to
create the most accurate possible alignment between your query
sequence and any homologues in the database.
• So we may need to accommodate several small gaps in highly
variable regions of the sequence which implies small penalties for
opening a gap.
• But more knowledge, experience, and thought about the
experiment being performed in a database search shows that the
most accurate alignment is not the correct goal.
• A little thought shows that the appropriate goal is to distinguish any
homologous sequences from the vast majority of nonhomolgous
sequences in the database.
52
searching
database
Sequence
and deletions
Insertions
Expected Maximum Scores
• The score for homologous sequences will be
dominated by relatively long sections of the
alignment uninterrupted by gaps. Thus
changing the gap penalty will only have a minor
effect on the score for an alignment.
• Gap penalties, however, are critical in
determining the expected maximum score
for random sequences being compared with
your query sequence.
53
searching
database
Sequence
and deletions
Insertions
High Non-homologous Scores
• The number of nonhomologous sequences that achieve
a particular score decreases exponentially as the
score increases as long as the gap penalties are
sufficiently large relative to the similarity scores used to
compare the sequences.
• If the gap penalties are too small relative to the
similarity scores the number of sequences achieving a
particular score decrease as the square of the number
of sequences.
• This leads to an increase in the number of relatively
high scoring non-homologous sequences.
54
searching
database
Sequence
and deletions
Insertions
Decay:
Exponential versus Square
700
600
500
400
Exponential
Square
300
200
100
0
10
30
50
70
90
55
searching
database
Sequence
and deletions
Insertions
Obscure Homologous Sequences
• Unfortunately there is no analogous effect on
the scores of the homologous sequences.
• Thus choosing gap penalties that are too
small leads to matches with homologous
sequences being obscured by spurious
matches with nonhomologous sequences.
56
searching
database
Sequence
and deletions
Insertions
Length of Gaps
• While we want to force alignments to have relatively few
gaps we need to allow them to be fairly long.
• This allows a few gaps to accommodate a larger number of
insertion and deletion events in the evolutionary history of the
sequences.
• So we should match a large penalty for opening a new gap with
a small penalty for lengthening or extending a gap.
• In this context “large” will typically be two to three times the
largest negative value in the similarity matrix and at least as
large as the largest positive value.
• For an integer similarity matrix the gap extension penalty will
typically be 1-4 and should not be 0.
57
searching
database
Sequence
and deletions
Insertions
Statistical Significance
• A Significance probability is an answer to the question:
– How often will an event at least as extreme as this one just
observed happen if these events are the result of a well
defined, specific, random process?
• For database searches the random process is evolution
from independent, unrelated ancestral sequences rather
than a common ancestral sequence.
• With respect to any newly determined query sequence
most of the sequences in the database are such
randomly evolved sequences.
58
searching
database
Sequence
and deletions
Insertions
Random Sequences
• We expect the scores for random sequences
aligned with our query sequence to decay
exponentially.
• For a group of sequences that achieve at least some
minimum score we expect only some constant fraction
of them to achieve a given higher score.
• If the gap penalty is high enough this exponential decay
is observed in practice.
59
searching
database
Sequence
and deletions
Insertions
Predicting Good Scores
• This exponential decay process gives a straight line
when we plot the log of the number of sequences that
achieve a particular score against the score.
Extrapolating this line to its intercept with the score axis
gives us a score higher than we might reasonably
expect to see from a group of random sequences the
size of the database.
• Thus any appreciably higher score is a statistically
rare event deserving careful evaluation.
60
searching
database
Sequence
and deletions
Insertions
Problems if Gap to low
• Empirically this depends on the gap penalty being high
enough.
• If the gap penalty is too low, the scores will decay as
the square of the number of sequences achieving at
least a given score.
• This slower decay obscures most homologous matches
(as discussed before) and also invalidates the fitting of
the scores to a logarithmic function and yields a much
too high extrapolated score.
61
searching
database
Sequence
and deletions
Insertions
Difference between Logs
10
9
8
7
6
Log(Exponential)
5
Log(Square)
4
3
2
1
0
10 20 30 40 50
60 70 80 90 100
62
searching
database
Sequence
testing
Statistical
Information Theory
• Every alignment has a certain amount of information
per position about whether or not the aligned
sequences are homologous. This is called entropy.
• Every similarity matrix has an associated amount
of entropy.
• This amount of entropy is a maximum and is
achieved only in alignments between sequences that
have diverged to the degree appropriate to the
matrix.
63
searching
database
Sequence
testing
Statistical
Entropy
• It is currently impossible to assign an entropy value to alignment
gaps corresponding to insertions or deletions.
• Thus entropy can be computed only for regions of the
alignment uninterrupted by gaps.
• The amount of information (or entropy) associated with an
uninterrupted region of the alignment is given by the length of the
alignment in sequence residues multiplied by the entropy value for
the similarity matrix.
• It is measured in units called bits, which is the amount of
information need to answer a yes no question.
64
searching
database
Sequence
testing
Statistical
Entropy of Random Sequences
• In order to make use of the information of an alignment we
need to know how much information we expect to find in
alignments between random sequences.
• The answer obviously depends on:
– Length of the query sequence
– Size of the database
• Without going into the details of the derivation, this is the
amount of information required to repeatedly divide the
database into the remaining half containing the best
scoring segment.
65
searching
database
Sequence
testing
Statistical
Test for Statistical Significance
• S ≈ log2(mn)
– The minimum amount of information (in bits) for a significant
match is given by the logarithm (base 2) of the product of the
length of the query sequence and the number of residues in
the database.
• If there is more than this amount of information in the
alignment it is statistically significant and should be
investigated further.
• This is the basis of the statistical significance tests in
BLAST - now used by the other algorithms as well.
• E = mn 2-S
– is the E-value formula where m is the length of the
alignment, n is the size of the data base and S is the bit
score of entropy.
66
searching
database
Sequence
testing
Statistical
E Value
• The E-value is the chance number of alignments with a
certain score, s.
– E=Kmn e-s
– where s is the similarity score, m and n are as above, and K
and  can be thought of as natural scales for the search space
size and scoring system respectively.
• This makes intuitive sense since doubling either the
size of the database or the size of the alignment should
double the number of random strings attaining that
score. Also for us to attain a score twice as good, 2x, it
must attain the score x twice in a row, so one expects E
to decrease exponentially with score.
67
searching
database
Sequence
testing
Statistical
Converting raw scores to bit scores
S
s  ln K
ln 2
• S is bit score, s is alignment score
 and translation parameters)
• , K (scale
– Can be estimated empirically for a given database
and score matrix (even including gaps) or calculated
exactly assuming a random ungapped model and
background amino acid frequencies
68
searching
database
Sequence
testing
Statistical
Usefulness of Statistical Measures
• The usefulness of statistical measures is dependent on:
– Whether the assumptions of the underlying statistical model are
correct
– The kinds of errors that one is willing to accept when using the
measure to draw a conclusion
• The threshold for statistical significance will vary
depending on which we are more concerned with:
– Labeling a sequence as related when it is not - misidentifying a
non-homologue (False Positive or Type I error)
– Labeling a sequence as non-homologous when a high scoring
homologue has been found - missing a likely homologue (False
Negative or Type II error)
69
searching
database
Sequence
testing
Statistical
Problems with Statistical Models
• However, with biological sequences (as
opposed to “fair” coins), the assumptions
underlying the statistical model may not be
met.
• In general inaccurate statistical estimates are
cause either by :
– Incorrect gap penalties
– Low complexity regions - runs of simple amino
acid composition
70
searching
database
Sequence
testing
Statistical
Additional Statistical Checks
• Prss3 in the FASTA collection evaluates the significance of a
sequence similarity score by comparing two sequences and
calculating optimal scores and then repeatedly shuffling the second
sequence and getting optimal similarity scores using the SmithWaterman algorithm. You must use 500-2000 shuffles and
remember to normalize the statistical significance to the size of the
database originally searched. - permutation test, randomization
testing
• Randseq also in FASTA produces a random sequence with the
same length and amino acid composition as a query sequence. In
general in a database search, the highest scoring match to a
random query sequence should have an E value of 1.
71
Sequence database searching
References
• http://www.psc.edu/biomed/TUTORIALS/SEQUENCE/D
BSEARCH/tutorial.html
• http://www.ncbi.nlm.nih.gov/BLAST/
• Flexible Sequence Similarity Searching with the
FASTA3 program package, William Pearson
• Bioinformatics A Practical Guide the Analysis of Genes
and Proteins Baxevanis & Ouellette
72