Lecture 1/17/2006
Download
Report
Transcript Lecture 1/17/2006
Back to basics – Probability, Conditional Probability and
Independence
• Probability of an outcome in an experiment is the proportion of times
that this particular outcome would occur in a very large (“infinite”)
number of replicated experiments
• Random variable is a mapping assigning real numbers to the set of all
possible experimental outcomes - often equivalent to the experimental
outcome
• Probability distribution describes the probability of any outcome, or
any particular value of the corresponding random variable in an
experiment
• If we have two different experiments, the probability of any combination
of outcomes is the joint probability and the joint probability
distribution describes probabilities of observing and combination of
outcomes
• If the outcome of one experiment does not affect the probability
distribution of the other, we say that outcomes are independent
• Event is a set of one or more possible outcomes
1-17-06
1
Back to basics – Probability, Conditional Probability and
Independence
• Let N be the very large number of trials of an experiment, and ni be the number of times
that ith outcome (oi) out of possible infinitely many possible outcomes has been observed
• pi=ni/N is the probability of the ith outcome
• Properties of probabilities following from this definition
1) pi 0
2) pi 1
3)
pi
i
n
i
N
i
N
1
N
4) For any set of mutually exclusive events (events that don't have any outcomes
in common) e1 {o11 , o12 , o31 ,...},e2 {o12 , o22 , o32 ,...},...
p(
ei ) p({o11 , o12 , o31 ,..., o12 , o22 , o32 ,...,...}
i
p (e )
i
i
5) p(NOT e) = 1-p(e) for any event e
1-17-06
2
Conditional Probabilities and Independence
• Suppose you have a set of N DNA sequences. Let the random variable
X denote the identity of the first nucleotide and the random variable Y
the identity of the second nucleotide.
P(X x, Y y)
n xy
N
, x, y {A, C, G, T}
n
P (X x) x
N
P(Y y)
ny
N
• The probability of a randomly selected DNA sequence from this set to
have the xy dinucleotide at the beginning is equal to P(X=x,Y=y)
• Suppose now that you have randomly selected a DNA sequence from
this set and looked at the first nucleotide but not the second. Question:
what is the probability of a particular second nucleotide y given that
you know that the first nucleotide is x*?
P(Y y | X x )
*
n x*y
n x*
P(X x * , Y y)
n x* / N
P(X x * )
n x*y / N
• P(Y=y|X=x*) is the conditional probability of Y=y given that X=x*
• X and Y are independent if P(Y=y|X=x)=P(Y=y)
1-17-06
3
Conditional Probabilities Another Example
• Measuring differences between expression levels under two different
experimental condition for two genes (1 and 2) in many replicated
experiments
• Outcomes of each experiment are
• X=1 if the difference for gene 1 is greater than 2 and 0 otherwise
• Y=1 if the difference for gene 2 is greater than 2 and 0 otherwise
• The joint probability of differences for both genes being greater than 2
in any single experiment is P(X=1,Y=1)
y
n1x
P(X 1)
N
n
P(X 1, Y 1) 11
N
P(Y 1)
n1
N
• Suppose now that in one experiment we look at gene 1 and know that
X=0 Question: What is the probability of Y=1 knowing that X=0
• P(Y=1|X=0) is the conditional probability of Y=1 given that X=0
P(Y 1 | X 0)
n 01
n 0x
n 01 / N
n 0x / N
P(X 0, Y 1)
P(X 0)
• X and Y are independent if P(Y=y|X=x)=P(Y=y) for any x and y
1-17-06
4
Conditional Probabilities and Independence
• If X and Y are independent, then from
p(Y | X)
p(X, Y)
and p(Y | X ) P(Y ) follows that
p ( X)
p(X, Y) p( X and Y ) p( X ) p(Y )
• Probability of two independent events is equal to the product of their
probabilities
1-17-06
5
Identifying Differentially Expressed Genes
• Suppose we have T genes which we measured under two experimental conditions (Ctl and
Nic) in n replicated experiments
• ti* and pi are the t-statistic and the corresponding p-value for the ith gene, i=1,...,T
• P-value is the probability of observing as extreme or more extreme value of the t-statistic
under the “null-distribution” (i.e. the distributions assuming that iCtl = iNic ) than the one
calculated from the data (t*)
• The ith gene is "differentially expressed" if we can reject the ith null hypothesis iCtl = iNic
and conclude that iCtl iNic at a significance level (i.e. if pi<)
• Type I error is committed when a null-hypothesis is falsely rejected
• Type II error is committed when a null-hypothesis is not rejected but it is false
• Experiment-wise Type I Error is committed if any of a set of (T) null hypothesis is falsely
rejected
• If the significance level is chosen prior to conducting experiment, we know that by
following the hypothesis testing procedure, we will have the probability of falsely
concluding that any one gene is differentially expressed (i.e. falsely reject the null
hypothesis) is equal to
• What is the probability of committing a Family-wise Type I Error?
• Assuming that all null hypothesis are true, what is the probability that we would reject at
least one of them?
1-17-06
6
Experiment-wise error rate
Assuming that individual tests of hypothesis are independent and true:
p(Not Committing The Experiment-Wise Error) =
p(Not Rejecting H01 AND
Not Rejecting H02 AND ... AND Not Rejecting H0T)
=
0.4
0.6
alpha=0.05
alpha=0.01
alpha=0.001
alpha=0.0001
0.2
Family-Wise Type I Error Rate
0.8
1.0
(1- )(1- )...(1- ) = (1- )T
p(Committing The Experiment-Wise Error) =1-(1- )T
1-17-06
0
5000
10000
Nubmer of Hypothesis
15000
7
Experiment-wise error rate
0.4
0.6
alpha=0.05
alpha=0.01
alpha=0.001
alpha=0.0001
alpha=0.000003
0.2
Family-Wise Type I Error Rate
0.8
1.0
If we want to keep the FWER at level:
Sidak’s adjustment: a= 1-(1- )1/T
FWER=1-(1- a )T = 1-(1-[1-(1- )1/T])T = 1-((1- )1/T)T = 1-(1-) =
For FWER=0.05 a=0.000003
0
1-17-06
5000
10000
Nubmer of Hypothesis
15000
8
Experiment-wise error rate
Another adjustment:
p(Committing The Experiment-Wise Error) =
(Rejecting H01 OR Rejecting H02 OR ... OR Rejecting H0T) T
(Homework: How does that follow from the probability properties)
Bonferroni adjustment: b= /T
•Generally b<a Bonferroni adjustment more conservative
•The Sidak's adjustment assumes independence – likely not to be
satisfied.
•If tests are not independent, Sidak's adjustment is most likely
conservative but it could be liberal
1-17-06
9
Adjusting p-value
Individual Hypotheses:
H0i: iW = iC
pi=p(tn-1 > ti*) , i=1,...,T
"Composite" Hypothesis:
H0: {iW = iC, i=1,...,T}
p=min{pi, i=1,...,T}
• The composite null hypothesis is rejected if even a single individual hypothesis
is rejected
• Consequently the p-value for the composite hypothesis is equal to the minimum
of individual p-values
• If all tests have the same reference distribution, this is equivalent to
p=p(tn-1 > t*max)
• We can consider a p-value to be itself the outcome of the experiment
• What is the "null" probability distribution of the p-value for individual tests of
hypothesis?
• What is the "null" probability distribution for the composite p-value?
1-17-06
10
Null distribution of the p-value
Given that the null hypothesis is true, probability of observing the
p-values smaller than a fixed number between 0 and 1 is:
0.0
1.0
0.6
0.8
0.1
0.2
f (P-value)
1.2
0.3
1.4
0.4
p(pi < a)=p(|t*|>ta)=a
-4
-2
-ta
0
ta
2
4
t-statistics
The null distribution of t*
1-17-06
0.0
a
0.2
0.4
0.6
0.8
1.0
P-value
The null distribution of pi
11
Null distribution of the composite p-value
p(p < a) = p(min{pi, i=1,...,T} < a) =
= 1- p(min{pi, i=1,...,T} > a) =
= 1-p(p1> a AND p2> a AND ... AND pT> a) =
=Assuming independence between different tests =
=1- [p(p1> a) p(p2> a)... p(pT> a)] =
=1-[1-p(p1< a)] [1-p(p2< a)]... [1-p(pT< a)]=
=1-[1-a]T
Instead of adjusting the significance level, can adjust all p-values:
pia = 1-[1-a]T
1-17-06
12
Null distribution of the composite p-value
0
2
4
f (P-value)
6
8
10
The null distribution of the composite p-value for 1, 10 and 30000 tests
0.0
0.2
0.4
0.6
0.8
1.0
P-value
1-17-06
13
Seems simple
• Applying a conservative p-value adjustment will take care of
false positives
• How about false negatives
• Type II Error arises when we fail to reject H0 although it is
false
Power=p(Rejecting H0 when W -C 0)
= p(t* > t|W -C 0)=p(p< |W -C 0)
• Depends on various things (, df, , W -C)
• Probability distribution of is non-central t
1-17-06
14
Effects multiple comparison adjustments on power
http://homepages.uc.edu/%7Emedvedm/documents/Sample%20Size%20for%20arrays%20experiments.pdf
T=5000, =0.05, a =0.0001, W
-C = 10, = 1.5
n=10 significance
0 1
8.8
t4 : Green Dashed Line
t4,nc=6.1: Green Solid Line
1-17-06
t
n=5 significance
27.6
t9 : Red Dashed Line
t9,nc=8.6 Red Solid Line
15
This is not good enough
• Traditional statistical approaches to multiple comparison
adjustments which strictly control the experiment-wise error
rates are not optimal
• Need a balance between the false positive and false negative
rates
• Benjamini Y and Hochberg Y (1995) Controlling the False
Discovery Rate: a Practical and Powerful Approach to Multiple
Testing. Journal of the Royal Statistical Society B 57:289-300.
• Instead of controlling the probability of generating a single false
positive, we control the proportion of false positives
• Consequence is that some of the implicated genes are likely to be
false positives.
1-17-06
16
False Discovery Rate
• FDR = E(V/R)
• If all null hypothesis are true (composite null) this is equivalent
to the Family-wise error rate
1-17-06
17
False Discovery Rate
Alternatively, adjust p-values as
P(ifdr
) min{
1-17-06
m
P(i ) , j i, i 1,..., k}
j
18
Effects
> FDRpvalue<-p.adjust(TPvalue,method="fdr")
> BONFpvalue<-p.adjust(TPvalue,method="bonferroni")
FDR
Bonferroni
0.5
0
-1.0
4.0 e-06
-0.5
0.0
-log10(p-value)
1.0 e-05
6.0 e-06
8.0 e-06
-log10(p-value)
2
1
-log10(p-value)
1.2 e-05
3
1.4 e-05
1.0
Unadjusted
1-17-06
-4
-2
0
2
Mean Dif f erence
4
-4
-2
0
2
Mean Dif f erence
4
-4
-2
0
2
Mean Dif f erence
4
19