Transcript PowerPoint

(Alignment) Score Statistics
cbio course, spring 2005, Hebrew University
Motivation
• Reminder:
• Basic motivation: we want to check if 2
sequences are “related” or not
• We align 2 sequences and get a score (s) which
measures how similar they are
• Given s, do we accept the hypothesis the 2 are
related or reject it ?
How high should s be so that we “believe”
they are related ??
cbio course, spring 2005, Hebrew University
Motivation (2)
• We need a “rigorous way” to decide on a threshold s* so
that for s > s*, we call the sequences “related”
• Note:
• s* should obviously be s*(n,m) where n and m are the
length of the 2 sequences aligned
• When we try matching sequence x against a D.B of N
(N>>1) sequences, we need to account for the fact we
might see high scores “just by chance”
• We can make 2 kinds of mistakes in our calls:
• FP
• FN
→ We want our “rigorous way” to control FP FN mistakes
cbio course, spring 2005, Hebrew University
Motivation (3)
• The problem of assigning statistical significance to
scores and controlling our FP and FN mistakes is of
general interest.
• Examples:
• Similarities between protein sequnece to profile
HMM
• Log ratio scores when searching DNA sequence
motifs
• ….
• The methods we develop now will be of general
use
cbio course, spring 2005, Hebrew University
Reminder
• In the last lesson we talked about 2 ways to
analyze alignment scores and their significance:
• Bayesian
• Classical EVD approach
• We reviewed how the amount of FP mistakes can
be controlled using each of these approaces
• We reviewed Karlin & Altshul (1990) results
cbio course, spring 2005, Hebrew University
Review First Approach – Bayesian
Assume we have two states in our world :
M (Model = related sequences)
R (Random = un realated sequences)
Given a fixed alignment of two sequences (x,y) we ask “from which
state it came from M or R ?”
We saw:
Where:
cbio course, spring 2005, Hebrew University
cbio course, spring 2005, Hebrew University
cbio course, spring 2005, Hebrew University
Review Bayesian Approach cont.
• We saw that in order to control the expected
number of false identifications, when testing scores
which came from R, we need the threshold over the
scores S* to have S* ~ log(number of trials * K )
Where:
Number of trials for scoring a sequence of length m in local aligment against N
sequences of length n is nmN
K in [0,1] is correlation factor compensating for the fact the trials are correlated.
cbio course, spring 2005, Hebrew University
Review EVD Approach
• In the EVD approach we are interested in the question: “given a score s for
aligning x and y, If this s came from a distribution of scores for unrelated
sequences (like R in the Bayesian approach), What’s the probability of seeing a
score as good as s by chance, simply because I tried so many matches of
sequences against x”?
• R here is the null hypothesis we are testing against.
• If P(score >= s | we tried N scores) < Threshold (say 0.01) then we “reject” the
null hypothesis (R)
• NOTE:
• There is no “second” hypothesis here.
• We are guarding against type 1 errors (FP)
• No control or assumptions are made about FN here !!
• This setting is appropriate for the problem we have at hand (D.B search)
cbio course, spring 2005, Hebrew University
cbio course, spring 2005, Hebrew University
cbio course, spring 2005, Hebrew University
Toy Problem
• Let s,t be two randomly chosen DNA sequences of length n sampled from the
uniform distribution over the DNA alphabet.
• Align s versus t with no gaps (i.e. s[1] is aligned to t[1] until s[n] is aligned to t[n].)
• What is the probability that there are k matches (not necessarily continuous
ones) between s and t?
• Suppose you are a researcher and you have two main hypothesis:
Either these two sequences are totally unrelated or there was a common
ancestor to both of them (there is no indel option here).
• How would you use the number of matches to decide between the two options
and attach a statistical confidence measure to this decision?
cbio course, spring 2005, Hebrew University
0.25
Empirical Distribution overd scores, for p=0.25, using M = 100K samples
0.2
0.15
n = 20
0.1
n = 100
0.05
0
0
10
20
30
40
50
60
70
80
90
100
Pvalue for score = 30 and n= 100
NOTE: As in our “real” problem - pvalue of score depends on n
cbio course, spring 2005, Hebrew University
EVD for our problem
In the EVD approach, we are interested in the question:
“what is the probability of me seeing such a good a score as
S* , only from matches to non related sequences, if I tried N
such matches?”
Compute:
If we want to guarantee the P{ Max(S1 … SN) >= S*} < 0.05 where Si are
scores of matches against non related sequences sampled i.i.d , then:
[1 – pvalue(S*)]N > 0.95
i.e
cbio course, spring 2005, Hebrew University
Compare original distribution over d scores to its EVD with N = 10 (Original's params are n =100 p = 0.25 and M = 100K)
0.18
0.16
0.14
0.12
Prob
0.1
0.08
0.06
0.04
0.02
0
0
10
20
30
40
50
Score
cbio course, spring 2005, Hebrew University
60
70
80
90
100
Guarding against mistakes & evaluating
performace
• In the EVD we kept guarding against FP mistakes. This is very important when
doing tasks where many tests are preformed, as in our case of D.B search
• Sometime we are not able to compute EVD and we still want to control FPR. A
very strict and simple Solution is the “Bonferroni corrected pvalue” = pvalue*N
Where N is the number of tests perfromed.
Note: The relation to the “union bound” is clear
Problem: Bonf. Controls the FWER (family wise error rate) i.e the probability of
seeing even 1 mistake in the results we report as significant (a FP mistake).
It does so with very basically no assumption on the distribution, the relations
between the hypothesis tested etc. and still guaranties control over FWER
The price to pay is in FN …..
cbio course, spring 2005, Hebrew University
Bonf. Example on our case
• We saw:
If we want to guarantee the P{ Max(S1 … SN) >= S*} < 0.05
where Si are scores of matches against non related sequences sampled i.i.d , then:
[1 – pvalue(S*)]N > 0.95
i.e
Compare for N = 10 the result for this equation: 0.005116
to the Bonf. Corrected pvalue: 0.05/N = 0.005
For N = 20: 0.00256 vs. 0.002
etc…
If we used the strict Bonf. for the same guarantee level
we wanted, we might have rejected some “good” results.
cbio course, spring 2005, Hebrew University
How to estimate performance?
• Say you have a method with a score (in our case: “method” = scoring
local alignment with affine gaps, and scoring matrix “Sigma” (e.g. Sigma
= PAM1)
• You set a threshold over the scores based on some criteria (e.g EVD
estimation of the scores in random matches)
• You want to evaluate your methods performace on some “test” data set.
The data set would typically contain some true and false examples.
• Assumption: you KNOW the answers of this test set !
Aim: You want to see the tradeoff you get for using various thresholds on
the scores, in terms of FP and FN on the data set.
cbio course, spring 2005, Hebrew University
ROC curves
• ROC = Receiver Operator Curve
Sensitivity = TP/ (TP+FN) =
“What ratio of the true ones we capture”
100%
Best
Performance
0%
cbio course,
0% spring 2005, Hebrew University
FPR = False Positive Rate =
Empirical pvalue =
FP/ (FP + TN) =
FP / ( “real negatives”) =
“What ratio of the bad ones we pass”
100%
ROC curves
•
•
•
NOTE: Each point in the ROC matches a certain threshold over the method’s scores
Each method gives a different Curve
We can now compare methods performance:
• At a certain point on the graph
• Via the total size of area under the graph
Sensitivity = TP/ (TP+FN) =
“What ratio of the true ones we capture”
100%
Best
Performance
0%
2%
cbio course,
0% spring 2005, Hebrew University
FPR = False Positive Rate =
Empirical pvalue =
FP/ (FP + TN) =
FP / ( “real negatives”) =
“What ratio of the bad ones we pass”
100%
Back to our Toy Problem
Assume the data we need to handle came from two sources,
as in the Bayesian approach:
R – no related sequences, p(a,a) = 0.25
M – related sequences p(a,a) = 0.4 p(a,b) = 0.2
Delta scoring matrix i.e. S(a,a) = 1 S(a,b) = 0
Distribution over d scores, p= 0.25 Vs. p = 0.4 M = 100K
0.1
0.09
0.08
0.07
Prob
0.06
0.05
0.04
0.03
0.02
0.01
0
0
20
30
40
cbio course, spring 2005, 10
Hebrew
University
50
60
Score
70
80
90
100
Finish with a Thought…
• In our toy problem – what’s the relation between the
graph of the last slide and the ROC curve we talked
about?
• How does the relative amount of samples from M
and R in our data set effects the ROC? How should
the total distribution over the scores look like?
cbio course, spring 2005, Hebrew University