p - Department of Applied Mathematics & Statistics

Download Report

Transcript p - Department of Applied Mathematics & Statistics

Statistical significance for
genomewide studies
John D. Storey and Robert Tibshirani
Saurabh Paliwal
Topics in Bioinformatics class presentation
11/14/06
Outline of presentation

(Multiple) Hypothesis testing





Methodology of paper :




Hypothesis testing: terminology
Type I and II errors
Problem of multiple hypothesis testing
Choice of threshold
Motivating biological examples
False discovery rates, q values
Results
Discussion
Hypothesis testing
Usual steps in statistical hypothesis testing:
1.
Formulate the Null Hypothesis (H0) and the Alternate hypothesis
(H1):


2.
3.
4.
Null hypothesis: Statistical hypothesis that is tested for possible
rejection under the assumption that it is true
Alternate hypothesis: The observations are the result of a real effect.
(Hypothesis that the data is distributed different from the distribution of
the null hypothesis)
Identify a test statistic that can be used to assess the truth of the null
hypothesis
Compute the p-value: Probability that a test statistic at least as
extreme as the one observed would be obtained assuming that the
null hypothesis is true
Compare the p-value to an acceptable significance level  . If
p   then the observed effect is called statistically significant, the
null hypothesis is rejected in favor of the alternative hypothesis
Type I and II errors, other terms





Type I error / α error / false positive : the error of rejecting a null
hypothesis i.e. the test statistic is sufficiently extreme, when it is in
fact true
Type II error / β error / false negative : the error of not rejecting the
null hypothesis when the alternative hypothesis is true
Power of a test / Sensitivity = 1-β i.e. the probability of not
committing a type II error
Specificity = 1- α i.e. the probability of not making a type I error
Critical region/ rejection region: A set of values (greater than
threshold) of the test statistic for which the null hypothesis is
rejected
Some common pitfalls





If the test statistic is outside the critical region (less than
threshold), the only conclusion is that ‘there is not enough
evidence to reject the null hypothesis’. This is NOT the same as
evidence in favor of the null hypothesis. In other words, failing to
find evidence that there is a difference does not constitute
evidence that there is no difference.
The p-value is NOT the probability that a finding is “merely a
fluke”
The p-value is NOT the probability that the null hypothesis is true,
or (1-(p-value)) is NOT the probability of the alternative
hypothesis being true
The significance level of the test is NOT determined by the pvalue. It is decided upon before any of the data are collected
A statistically significant result is not always of practical
significance / does not necessarily demonstrate a large effect in
the population. Given a sufficiently large sample, extremely small
and non-notable differences can be found to be statistically
significant
some more terms…


One-sided / one-tailed test: Values for which we can
reject the null hypothesis are located entirely in onetail of the probability distribution
Two-sided / two-tailed test: Values … are located in
both tails of the probability distribution
Multiple hypothesis testing

m hypothesis tests





H1 = 0 vs H1 = 1
H2 = 0 vs H2 = 1
…. Hm = 0 vs Hm = 1
Want to make simultaneous inference
Each test has possible Type I and Type II errors and there are
many possible ways to combine them
The multiple testing problem


Assume all the test statistics are null. As the number of independent
applications of the hypothesis testing criterion grows, it begins to outweigh
the high unlikelihood associated with each individual test. It becomes
increasingly likely that that one will observe data that satisfies the rejection
criterion by chance alone (even if the null hypothesis is true in all cases).
Thus, there is an increase in the number of ‘false positives’.
Example of flipping coins:
A coin is declared biased if in 10 flips, it lands on heads at least 9 times.



If the null hypothesis is that the coin is fair, the likelihood that a fair coin would
come up heads at least 9 times out of 10 is 11 / 210 = 0.0107 : pretty unlikely!
Multiple comparison: want to test the fairness of many coins, say 100, by this
method (flip each 10 times). The likelihood that all 100 fair coins are identified
as fair is (1 – 0.0107)100 = 0.34 : pretty likely (0.66) that some will be identified
as biased! (note that the probability of seeing a pre-selected coin do this is still
0.0107)
In n independent comparisons are performed, the probability that one or
more false positives will be detected is given by
It increases as the number of comparisons increase (in fact, it will go to 1
eventually).





All quantities except m and S are unobservable
m0 = truly null, m1 = truly alternative
Regardless of whether the p-value threshold is fixed or data-dependent,
the quantities F, T and S are random variables. Hence, it is common
statistical practice to write the overall error in terms of an expected
value, E[·]
Specificity = m0 - F / m0
Sensitivity = T / m1
How does one choose a
threshold?

Control the Per-Comparison Type I error (PCER)



Control the Familywise Type I error (FWER)





i.e. “uncorrected testing”, too many false positives
Gives P(Fi = 1) ≤ α marginally for all 1 ≤ i ≤ m
E.g. Bonferroni correction: can guarantee that P(F ≥ 1) ≤ α by
setting individual test p-values ≤ α/m.
Follows from Boole’s inequality :
May be appropriate if very few features are expected to be truly
alternative (e.g. linkage analysis)
However, typically it is much too conservative for a number of
applications e.g. genomewide studies involving differentially
expressed genes, fMRI studies etc
Control the False Discovery Rate (FDR)



Guarantees that the FDR = E [F / F+T] = E [F/S] ≤ α
Sensible balance between the number of false positive features, F
and the number of true positive features, T
More later…
Biomedical significance of the
multiple testing problem


Example 1 : Detecting differentially expressed genes : Hedenfalk et.al.,
N. Engl.J.Med. 2001
 Detect differential expression of genes (features in this case) between
BRCA1 and BRCA2 mutation-positive tumors
 Computed a modified F statistic, which was used to assign a p value to
each gene
 p-value of 0.001 was selected to find 51 differentially expressed (DE)
genes out of 3226 (~ 3 false positives expected)
 More conservative threshold of 0.0001 yielded 9-11 DE genes
Example 2 :Identifying Exonic Splicing Enhancers : Fairbrother et.al.,
Science, 2002
 Exonic splice enhancers are short oligonucleotide sequences that
enhance pre-mRNA splicing when present in exons
 They analyzed human genomic DNA to predict exonic splice enhancers
based on the statistical analysis of exon-intron and splice site
composition
 Used a p-value associated with 4096 possible hexamer sequences.
Cutoff of 10-4 results in an expected value of < 1 false positive.
 238 significant hexamers were subsequently biologically verified


Example 3: Genetic dissection of transcriptional regulation, Brem et.al.,
Science, 2002
 Statistically significant linkage between a gene’s expression level and a
marker indicates that a regulator for that gene is located in the region of
the marker
 Tested each of 6215 genes for linkage to at least one locus, resulted in
6215 p values
 p-value cutoff of 8.5x 10-3 was used, and 507 genes showed linkage to at
least one locus, where 53 are expected by chance
 p-value cutoff of 1.6x 10-4 : 205 genes (where 1 is expected by chance)
Example 4: Finding binding sites of transcriptional regulators: Lee et.al.,
Science, 2002
 Transcriptional regulatory proteins bind to specific promoter sequences
to participate in the regulation of gene expression
 Binding of 106 transcriptional factors was analyzed all over the genome.
At each genomic location, a p value was calculated under the null
hypothesis that no binding occirs, resulting in thousands of p values
 At a p-value of 0.001, they estimate 3985 interactions, ~6-10% are false
positives
fMRI applications





Compare two sets of
conditions and using
statistical methods,
analyze the difference in
‘brain activity’ in particular
parts of the brain
100,000 voxels in fMRI
Low signal to noise ratio
in images
High number of features
will lead to enhanced
number of false positives,
and a huge difficulty in
recognizing the actual
area of interest
Bonferroni correction is
far too conservative
p values and q values
The p value is a measure of significance in terms
of the false positive rate i.e. the rate that truly null
features are called significant
 The q value is a measure in terms of the false
discovery rate i.e. the rate that significant
features are truly null
 Note the difference between FPR and FDR
 Hence, a p-value cutoff says little about the
content of the features actually called significant

(Positive) False Discovery Rate
(p)FDR


False discovery rate:
There is the possibility that S = 0, in which case F/S is undefined.
Hence, define the positive False discovery rate (3 possible
formulations discussed in Benjamini and Hochberg, 1995 : R is
the same as S, and V is the same as F): E  F | S  0  Pr(S  0),
Commonly referred to as FDR
Commonly referred to as pFDR
 S

F

E  | S  0 ,
S

E F 
E S 


For the purposes of the paper, first concentrate on the assumption
of S > 0 (S = 0 case will be discussed at the end)
In terms of specificity
and sensitivity
, one can write
the FDR as:



The FDR is a measure of the overall accuracy of the set
of features declared to be significant
The q value is a measure that reflects the significance
that can be attached to each individual feature.
The q value of a certain feature can be described as the
expected proportion of false positives among all features
as or more extreme than the observed one. (Similar to
the p-value definition as the probability of a null feature
being as or more extreme than the observed one)
Methodology
1.
2.
3.
The authors first calculated a
p-value for each of the 3170
genes of Example 1 (Using a
two sample t statistic*)
Plotted a density histogram of
the 3170 p values
Order the p-values in
increasing order of
magnitude. For some
threshold t, where 0<t<1, all
the features with p values
less than t are called
significant.
Let these m p values be
4.
5.
6.
7.
Estimate FDR(t) as
Since m
is very large, this can be approximated as
(proved later on)
Simple estimate of E[S(t)] is the observed
S(t) i.e. number of observed p values ≤ t
The probability a null p value is ≤ t is simply
t. Hence, E[F(t)] = m0 . t
However, since m0 is unknown, it has to be
estimated.
Define the ratio of features that are truly null
to total features = m0 / m = π0
Need to specify the distribution of the truly
alternative p values to estimate π0
However, since the null p values are
uniformly distributed, we can get an estimate
Aside: Note that if all the genes
were null (not differentially
expressed, the density
histogram would look like this
8.
Estimate π0 in terms of a tuning parameter λ
The rationale behind this estimate:
p values of truly alternative features will be
close to 0, while p values of null features will
be uniformly distributed among [0,1]. ‘Most’
of the p values near 1 will be null.
An unbiased estimate of π0 would be
(assuming that we could count only null
values). However, presence of a few
alternative p values only makes the estimate
conservative
There must be
mostly null p
values in this
region of the
graph (p> λ),
where λ = 0.5
Conservative
estimate of
overall proportion
of p values
Bias–Variance tradeoff:
As λ is closer to 1, the
bias is lower (because
lesser and lesser number
of alternative p values will
be found there), but the
variance of the estimate
increases (because lesser
number of points are
being used to estimate π0)
…and vice-versa
Notice the high
variance in the
estimate here. It
makes it necessary to
estimate π0 using a
cubic spline




9.
Using the above estimate for π0 to estimate FDR(t) as
10.
Estimate the q value of feature i as
Thus, the mathematical definition of the q value is the minimum FDR that can
be attained when calling that feature significant
The above method ensures that the estimated q values are increasing in the
same order as the p-values
Suppose that each feature’s statistic probabilistically follows a random mixture
of a null distribution and an alternative distribution. Then the pFDR can be
written as Pr (feature i is truly null | feature i is significant) ~ q value
False positive rate is Pr(feature i is significant | feature i is truly null) ~ p value
The p value can be thought of as the minimum possible false positive rate
when calling the feature significant
The algorithm, in brief
Results



The initial estimate (based on a
tuning parameter of λ=0.5) is that
33% of the examined genes were
differentially expressed in the
study of example 1 (Hedenfalk
et.al.)
Thresholding genes with q values
≤ 0.05 yields 160 genes
significant for differential
expression (~8 genes are
expected to be false positives).
117 of these 160 were found to be
overexpressed in BRCA1mutation-positive tumors.
Since all q values can be
considered simultaneously, we
can use several plots to help us
calibrate the q-value cutoff that
should be applied in a study
based on curves of the form
shown in (b), (c) and (d) on the
right
further interpretation of results




Assume a gene, say MSH2, whose p value is 5.05 x 10-5 and a q value
of 0.013. The former implies that the probability that a null
(nondifferentially expressed) gene would be as or more extreme than
MSH2 is 5.05 x 10-5. The latter on the other hand suggests that ~0.013
of the genes that are as or more extreme than MSH2 are false
positives.
Note: q value is not the probability that the feature (say MSH2) is a
false positive.
Intuitively, the probability that MSH2 is a false positive is higher than
that of the other genes which are more significant than MSH2, thus it is
like a “local FDR”.
The q value takes into account multiple features simultaneously (every
feature as or more extreme will also be significant), which is important
when assigning multiple measures of statistical significance.
Analysis of Hedenfalk et.al. data


Data consisted of 3226 genes on n1 = 7 BRCA1 arrays and n2 = 8
BRCA2 arrays. Disregarded some genes that had measurements that
were several interquartile ranges away from the interquartile range of
all of the data.
The expression value from array j and gene i is denoted by xij. Then,
the sample mean and variance for gene i are given by

xi 2 

jBRCA 2
n2
xij
and
si 2 2 

jBRCA 2
( xij  x i 2 ) 2
(n2  1)

The two-sample t statistic for gene i allowing for different variances of
the gene in the two tumors is given by

The null versions of t1…t3170 are calculated by a permutation method:
the labels on the arrays are randomly scrambled and the t statistics
are recomputed for B = 100 permutations. The p value for gene i was
calculated as :
Theorem proof
Critical discussion





Used t-statistic for calculation of p values instead of the f-statistic used
in the Hedenfalk paper. Could have additionally used the same values
as in that paper to make the comparison easier in terms of number of
genes detected as significant etc.
Do not show whether the increased number of genes found are
significant for differentiating between BRCA1 positive or negative
samples, or BRCA2 positive or negative samples
The assumption of null p values being uniform is critical to the algorithm
(as mentioned by the authors too). However, it would be interesting to
see how they would handle it if the null p value distribution was different
The assumption is that as m →∞, their procedure controls the FDR
asymptotically (i.e. if all features with q ≤ α are taken, then FDR ≤ α for
large m). Another recent paper Benjamini, Krieger, Yekutieli (BKY) 2004
suggested that many of the cases of practical interest may not have
such a high m value, so this may not be as relevant for those cases
The effect of dependency of the various features has not been
investigated enough, it could potentially be very important. BKY 2004
find that FDR can be almost double the bound.
Parting thought….
"... surely, God loves the .06 nearly as
much as the .05."
(Rosnell and Rosenthal 1989)