Presentation Title Goes Here

Download Report

Transcript Presentation Title Goes Here

ODP and SVA
European Institute of Statistical
Genetics
Liege, Belgium
September 4, 2007
Greg Gibson
What’s the matter with t-tests?
1.
SAM and ANOVA assume that all tests are
independent, but they aren’t
2.
Some within sample variances are
underestimated, which artificially inflates
test statistics; some are overestimated,
which reduces power
3.
They fail to optimize the ETP (true positive
estimation) rate for a given FDR
Optimal Discovery Procedure
Storey, Dai and Leek (2007) Biostatistics 8: 414-432
• the ODP is defined as the testing procedure that maximizes
the ETP for each fixed EFP level.
• A consequence of this optimality is that the rate of “missed
discoveries” is minimized for each FDR level.
• Neyman–Pearson lemma: Given a single set of observed data,
the optimal single-testing procedure is based on the statistic:
• the ODP is similar, but considers the data for a single feature
evaluated at all true probability density functions:
ODP Principle
Fig. 1. Plots comparing the NP testing approach to the ODP testing approach through a simple example. (a) NP approach. The null (gray) and
alternative (black) probability density functions of a single test. For observed data x and y, the statistics are calculated by taking the ratio of the
alternative to the null densities at each respective point. In this NP approach, the test with data y is more significant than the test with data x.
(b) ODP approach. The common null density (gray) for true null tests and the alternative densities (black) for several true alternative tests. For
observed data x and y, the statistics are calculated by taking the ratio of the sum of alternative densities to the null density evaluated at each
respective point. In this ODP approach, the test with data x is now more significant than the test with data y because multiple alternative
densities have similar positive means even though each one is smaller than the single alternative density with negative mean.
ODP Performance: BRCA data
A comparison of the ODP approach to five leading methods for identifying differentially expressed genes (described in the text). The
number of genes found to be significant by each method over a range of estimated q-value cutoffs is shown. The methods involved in
the comparison are the proposed ODP, SAM, the traditional t-test/ F-test, a shrunken t-test/F-test, a nonparametric empirical Bayes
"local FDR" method, and a model-based empirical Bayes method. A color version of the figure is given in the supplementary material
available at Biostatistics online, Figure 9. (a) Results for identifying differential expression between the BRCA1 and BRCA2 groups in
the Hedenfalk and others data. (b) Results for identifying differential expression between the BRCA1, BRCA2, and Sporadic groups in
the Hedenfalk and others data. The model-based empirical Bayes method has not been detailed for a three-sample analysis, so it is
omitted in this panel.
ODP Table 1
Thresholding method
% Increase by ODP 2-sample
% Increase by ODP 3-sample
Minimum
Median
Maximum
Minimum
Median
Maximum
SAM (Tusher et al, 2001)
29
43
72
76
92
211
t/F-test (Dudoit et al 2002, Kerr et al, 2000)
52
86
185
63
82
407
Shrunken t/F-test (Cui and others, 2005)
34
52
77
61
69
154
Bayesian local FDR (Efron and others, 2001)
58
87
117
76
92
211
Posterior probability (Lonnstedt & Speed 2002)
44
60
113
—
—
—
Table 1. Improvements of the ODP approach over existing thresholding methods. Shown are the minimum, median, and maximum percentage
increases in the number of genes called significant by the proposed ODP approach relative to the existing approaches among FDR levels 2%,
3%, ..., 10%. The exact same FDR methodology (Storey, 2002; Storey and Tibshirani, 2003) was applied to each gene-ranking method in
order to make the comparisons fair. The model-based Bayesian method (Lonnstedt and Speed, 2002) is not defined for a three-sample
analysis, so that case is omitted
ODP algorithm
1.
Estimate the true null hypotheses from distribution
of P-values from KW rank tests for all genes
2.
Determine the maximum likelihood distributions for
all genes according to standard methods:
3.
Evaluate the ODP statistic for each gene:
4.
Use bootstrap resampling to obtain null statistics
5.
Contrast observed and expected ODPs -> q values
ODP Performance: simulated data
A comparison of the ODP approach to five
leading methods for identifying differentially
expressed genes (described in the text and
Figure 2) based on simulated data. The number
of genes found to be significant by each
method over a range of estimated q-value
cutoffs is shown for a single, representative
data set from each scenario. The proposed
ODP approach is in black and the other
methods are in gray. In general, the data sets
increase in complexity from panels (a) to (d).
(a) In this scenario, two groups are compared,
there is perfectly symmetric differential
expression, and the variances are simulated
from a unimodal, well-behaved distribution.
(b) Two groups are compared, there is
moderate asymmetry in the differential
expression, and the variances are simulated
from a bimodal distribution. (c) Three groups
are compared, there is slight asymmetry in
differential expression, and the variances are
simulated from a unimodal, well-behaved
distribution. (d) Three groups are compared,
there is moderate asymmetry in differential
expression, and the variances are simulated
from a bimodal distribution.
Surrogate Variable Analysis
Leek and Storey (2007) PLoS Genetics, In press
•
In addition to the primary measured variables that are
estimated as fixed or random effects in an analysis, there are
usually also unmodeled
heterogeneity.
factors that contribute to expression
•
For example, age, time-of-day, nutrition probably all impact an
analysis without being directly studied, but they are more
predictable than gene specific noise.
•
Sometimes the variable of interest may be confounded with the
hidden factors (eg batch with population).
•
In many situations, SVA can be used to improve power.
SVA Simulation
Simulated Example of Expression Heterogeneity
(A) A heatmap of a simulated microarray study consisting of 1,000 genes measured on 20 arrays. (B) Genes 1-300 in
this simulated study are differentially expressed between two hypothetical treatment groups; here the two groups are
shown as an indicator variable for each array. (C) Genes 201-500 in each simulated study are affected by an
independent factor that causes EH. This factor is distinct from, but possibly correlated with the group variable. Here
the factor is shown as a quantitative variable, but it could also be an indicator variable or some linear or nonlinear
function of the covariates.
SVA Table 1
The results of the significance analysis in the three real gene expression studies. The results of the genetics of gene expression study include the
number of significant cis-linkages before and after adjusting for surrogate variables. The disease class results report the number of genes
differentially expressed between BRCA1 and BRCA2 before and after adjusting for surrogate variables. For the timecourse study, the number of
genes differentially expressed with respect to age are shown for an unadjusted analysis, an analysis adjusted for tissue type, and an SVA adjusted
analysis. An SVA-adjusted analysis may result in an increase or decrease in the number of significant results depending on the direction and
degree to which the unmodeled factors (now captured by surrogate variables) were confounded with the primary variables.
SVA Performance
Impact of Expression Heterogeneity
One thousand gene expression data sets containing EH were simulated, tested, and ranked for differential expression as detailed in Simulated
Examples. (A) A boxplot of the standard deviation of the ranks of each gene for differential expression over repeated simulated studies.
Results are shown for analyses that ignore expression heterogeneity (Unadjusted), take expression heterogeneity into account by SVA
(Adjusted), and for simulated data unaffected by expression heterogeneity (Ideal). (B) For each simulated data set, a Kolmogorov-Smirnov test
was employed to assess whether the p-values of null genes followed the correct null Uniform distribution (Supplementary Text). A quantilequantile plot of the one thousand Kolmogorov-Smirnov p-values are shown for the SVA adjusted analysis (solid line) and the unadjusted
analysis (dashed line). It can be seen that the SVA adjusted analysis provides correctly distributed null p-values, whereas the unadjusted
analysis does not due to EH. (C) A plot of expected true positives versus false discovery rate for the SVA adjusted (solid) and unadjusted
(dashed) analyses. The SVA adjusted analysis shows increased power to detect true differential expression.
SVA Procedure
1.
Remove the signal due to the primary variable(s) of
interest to obtain a residual expression matrix.
2.
Apply a decomposition to the residual expression matrix to
identify signatures of EH in terms of an orthogonal basis
of singular vectors.
3.
Use a statistical test to determine the singular vectors
that represent significantly more variation than would be
expected by chance.
4.
Identify the subset of genes driving each orthogonal
signature of EH.
5.
For each subset of genes, build a surrogate variable based
on the full EH signature of that subset in the original data.
6.
Include all significant surrogate variables as covariates in
subsequent regression analyses, allowing for gene-specific
coefficients for each surrogate variable.
SVA: Trans-eQTL detection
SVA Captures EH Due to Genotype
(A) A plot of significant linkage peaks (p-value < 1e-7) for expression QTL in the Brem et al. [10, 21] study by
marker location (x-axis) and expression trait location (y-axis) . (B) Significant linkage peaks (p-value < 1e-7) after
adjusting for surrogate variables. Large trans-linkage peaks on Chromosomes II, III, VII, XII, XIV and XV have
been eliminated without reducing cis-linkage peaks.
SVA: Breast Cancer Study
Surrogate Variables from Human Studies
(A) A plot of the top surrogate variable estimated from the breast cancer data [22]. The BRCA1 group is relatively
homogeneous (triangles), but the BRCA2 group shows substantial heterogeneity (pluses). (B) A plot of tissue type
versus array for the Rodwell et al. [7] study (dotted line) and the top surrogate variable estimated from the
expression data when tissue was ignored (dashed line). There is strong correlation between the top surrogate
variable and the tissue type variable.
SVA: Moroccan study