Transcript Slide 1

Significance in protein
analysis
Swapan ‘Shop’ Mallick
Bioinformatics Group
Institute of Biotechnology
University of Helsinki
Overview
The need for statistics
Example: BLOSUM
What do the scores mean?
How can you compare two scores?
Example: BLAST
Problems with BLAST
Review of Distributions
Distribution of random BLAST results
P-values and e-values
Statistics of BLAST
Summary and Conclusion
Exercise
The need for statistics
• Statistics is very important for bioinformatics.
– It is very easy to have a computer analyze the data
and give you back a result.
– Problem is to decide whether the answer the computer
gives you is any good at all.
• Questions:
– How statistically significant is the answer?
– What is the probability that this answer could have
been obtained by random? What does this depend on?
Basics
N

n X
Sample
Population
S
Basics
N 
Descriptive statistics
n X
Sample
Population
Probability
Example: BLOSUM
The BLOSUM matrix assigns a probability
score for each residue pair in an alignment
based on:
the frequency with which that pairing is known to
occur within conserved blocks of related proteins.
Simple since size of population = size of sample
BLOSUM matrices are constructed from
observations which lead to observed
probabilities
BLOSUM substitution matrices
BLOSUM matrices are used in
‘log-odds’ form based on
actually observed substitutions.
This is because:
Ease of use: ‘Scores’ can be just
added (the raw probabilities would
have to be multiplied)
Ease of interpretation:
S=0 : substitution is just as likely
to occur as random
S<0 : substitution is more likely
to occur randomly than observed
S>0 : substitution is less likely
to occur randomly than
observed
Substitution matrices
Score of amino acid a
with amino acid b
Pab is the observed frequency that
residues a and b are correlated because
of homology
S (a, b)   log
1
Lambda is a scaling
factor equal to 0.347, set
so that the scores can be
rounded off to sensible
integers
pab
fa fb
fafb is the expected frequency of seeing residues a and b paired
together, which is just the product of the frequency of residue
a multiplied by the frequency of residue b
Source: Where did the BLOSUM62 alignment score matrix come from?
Eddy S., Nat. Biotech. 22 Aug 2004
Substitution matrices
Pab is the observed frequency that
residues a and b are correlated because
of homology
pab
fa fb
fafb is the expected frequency of seeing
residues a and b paired together, which is just
the product of the frequency of residue a
multiplied by the frequency of residue b
e
S
Lambda is a scaling
factor equal to 0.347,
set so that the scores
can be rounded off to
sensible integers
i) S=0 : O/E ratio=1
ii) Compare S=5 and
S=10. Ratio is based
on exponential
function
iii) S=-10: O/E ratio
= 0.031 ≈ 1/32.
iv) Ratio of scores
S1, S2 in terms of
probabilities of
observed/random =
i) S=0 : O/E ratio=1
ii) Compare S=5 and
S=10. Ratio is based
on exponential
function
32.1
5.7
iii) S=-10: O/E ratio
= 0.031 ≈ 1/32.
iv) Ratio of scores
S1, S2 in terms of
probabilities of
observed/random =
i) S=0 : O/E ratio=1
ii) Compare S=5 and
S=10. Ratio is based
on exponential
function
32.1
5.7
iii) S=-10: O/E ratio
= 0.031 ≈ 1/32.
iv) Ratio of scores
S1, S2 in terms of
probabilities of
observed/random =
i) S=0 : O/E ratio=1
ii) Compare S=5 and
S=10. Ratio is based
on exponential
function
32.1
iii) S=-10: O/E ratio
= 0.031 ≈ 1/32.
iv) Ratio of scores
S1, S2 in terms of
probabilities of
observed/random =
5.7
e
S1
/e
S2
e
 ( S1  S2 )
Example: BLAST
Motivations
Exact algorithms are exhaustive but
computationally expensive.
Exact algorithms are impractical for comparing
a query sequence to millions of other sequences
in a database (database scanning),
and so, database scanning requires heuristic
alignment algorithm (at the cost of optimality).
Interpret BLAST results - Description
ID (GI #, refseq #, DBGene/sequence Bit score – higher, better.
specific ID #) Click to access Definition
Click to access the
the record in GenBank
pairwise alignment
Links
Expect value – lower, better. It tells the
possibility that this is a random hit
Problems with BLAST
Why do results change?
How can you compare results from different
BLAST tools which may report different types of
values?
How are results (eg evalue) affected by query
There are _many_ values reported in the output –
what do they mean?
Example: Importance of Blast statistics
But, first a review.
Review
What is a distribution?
A plot showing the frequency of a given variable or
observation.
Review
What is a distribution?
A plot showing the frequency of a given variable or
observation.
Features of a Normal Distribution
Symmetric Distribution
Has an average or mean
value at the centre
Has a characteristic width
called the standard deviation
(S.D. = σ)
Most common type of
distribution known
 = mean
Standard Deviations (Z-score)
 ± 1.0 S.D. 0.683
>  + 1.0 S.D.
0.158
 ± 2.0 S.D. 0.954
>  + 2.0 S.D.
0.023
 ± 3.0 S.D. 0.9972
>  + 3.0 S.D.
0.0014
 ± 4.0 S.D. 0.99994
>  + 4.0 S.D.
0.00003
 ± 5.0 S.D. 0.999998
>  + 5.0 S.D.
0.000001
Mean, Median & Mode
Mode
Median
Mean
Mean, Median, Mode
In a Normal Distribution the mean, mode and
median are all equal
In skewed distributions they are unequal
Mean - average value, affected by extreme values
in the distribution
Median - the “middlemost” value, usually half
way between the mode and the mean
Mode - most common value
Different Distributions
Unimodal
Bimodal
Other Distributions
Binomial Distribution
Poisson Distribution
Extreme Value Distribution
Binomial Distribution
1
1 1
P(x) = (p +
q)n
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
Poisson Distribution
P( x) 
 =0.1
 =1
Proportion of samples
P(x)
 =2
 =3
 = 10
x
x e 
x!
Review
What is a distribution?
A plot showing the frequency of a given variable or observation.
What is a null hypothesis?
A statistician’s way of characterizing “chance.”
Generally, a mathematical model of randomness with respect to
a particular set of observations.
The purpose of most statistical tests is to determine whether the
observed data can be explained by the null hypothesis.
Review
What is a distribution?
A plot showing the frequency of a given variable or observation.
What is a null hypothesis?
A statistician’s way of characterizing “chance.”
Generally, a mathematical model of randomness with respect to
a particular set of observations.
The purpose of most statistical tests is to determine whether the
observed data can be explained by the null hypothesis.
Review
Examples of null hypotheses:
Sequence comparison using shuffled sequences.
A normal distribution of log ratios from a microarray
experiment.
LOD scores from genetic linkage analysis when the
relevant loci are randomly sprinkled throughout the
genome.
Empirical score distribution
The picture shows a
distribution of scores
from a real database
search using BLAST.
This distribution
contains scores from
non-homologous and
homologous pairs.
High scores from homology.
Empirical null score distribution
This distribution is
similar to the previous
one, but generated
using a randomized
sequence database.
Review
What is a p-value?
Review
What is a p-value?
The probability of observing an effect as strong or
stronger than you observed, given the null hypothesis.
I.e., “How likely is this effect to occur by chance?”
Pr(x > S|null)
Review
What is the name of the
distribution created by
sequence similarity scores,
and what does it look
like?
Extreme value distribution, or
Gumbel distribution.
It looks similar to a normal
distribution, but it has a
larger tail on the right.
Review
What is the name of the
distribution created by
sequence similarity scores,
and what does it look
like?
Extreme value distribution,
or Gumbel distribution.
It looks similar to a normal
distribution, but it has a
larger tail on the right.
8000
7000
6000
5000
4000
3000
2000
1000
0
<20
30
40
50
60
70
80
90
100
110
>120
Statistics
BLAST (and also local i.e. Smith-Waterman and BLAT scores)
between random, unrelated sequences follow the Gumbel Extreme
Value Distribution (EVD)
Pr(s>S) = 1-exp(-Kmn e-S)
This is the probability of randomly encountering a score greater than S.
S alignment score
m,n query sequence lengths, and length of database resp.
K,  parameters depending on scoring scheme and sequence composition
Bit score : S’ =
S – log(K)
log(2)
BLAST output revisited
S’ S

K
From: Expasy BLAST
E
n
m
Review
EVD for random blast
Upper tail behaviour:
Pr( s > S ) ~ Kmn e-S
This is the
EXPECT value =
Evalue
8000
7000
6000
5000
4000
3000
2000
1000
0
<20
30
40
50
60
70
80
90
100
110
>120
Summary
Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Summary
Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K)
log(2)
Summary
Score and bit score
grow linearly with
the length of the
alignment
Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K)
log(2)
Summary
Score and bit score
grow linearly with
the length of the
alignment
Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K)
log(2)
E-value of bit score
E = mn2-S’
Summary
Score and bit score
grow linearly with
the length of the
alignment
Want to be able to compare scores in E-Value shrinks
sequences of different compositions or really fast as bit
score grows
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K)
log(2)
E-value of bit score
E = mn2-S’
Summary
Score and bit score
grow linearly with
the length of the
alignment
Want to be able to compare scores in E-Value shrinks
sequences of different compositions or really fast as bit
score grows
different scoring schemes
Score: S = sum(match) – sum(gap costs) E-Value grows
Bit score
S’ = S – log(K)
log(2)
E-value of bit score
E = mn2-S’
linearly with the
product of target
and query sizes.
Summary
Score and bit score
grow linearly with
the length of the
alignment
Want to be able to compare scores in E-Value shrinks
sequences of different compositions or really fast as bit
score grows
different scoring schemes
Score: S = sum(match) – sum(gap costs) E-Value grows
Bit score
S’ = S – log(K)
log(2)
E-value of bit score
E = mn2-S’
linearly with the
product of target
and query sizes.
Doubling target set size
and doubling query
length have the same
effect on e-value
Conclusion
You should now be able to compare BLAST results from different
databases, converting values if they are reported differently (which
happens frequently)
You should now know why BLAST results might change from one day to
the next, even on the same server
You should understand also the dependance of query length on E-value.
Statistical rankings are reported for (almost) every database search tool.
When making comparisons between databases, between sequences it is
useful to know how the statistics are derived to know if comparisons are
meaningful.
THE END
Supplemental
Section
Look through: Patterns in sequences (Searching
for information within sequences) - Some
common problems and their solutions:
http://lepo.it.da.ut.ee./~mremm/kurs/pattern.htm
What is the structure of my sequence?
http://speedy.embl-heidelberg.de/gtsp/flowchart2.html
(clickable!)