Use of the Half-Normal Probability Plot to Identify

Download Report

Transcript Use of the Half-Normal Probability Plot to Identify

Use of the Half-Normal Probability
Plot to Identify Significant Effects for
Microarray Data
C. F. Jeff Wu
University of Michigan
(joint work with G. Dyson)
1
Outline
• Current Methods
• Proposed Methodology
• Analysis Plan
• Example
• Conclusions
2
What are microarrays?
• Two major types
– Oligonucleotide gene chips
– Spotted glass arrays
• Perfect match (PM) and mismatch (MM)
probes are spotted onto a gene chip
– ~20 probes make up a probe set (or gene)
– MM probe for each gene has the middle base
set to the complement of its PM probe
– Hybridize labeled RNA corresponding to PM
probes
• Glass arrays involve the competitive
hybridization of two RNA pools to cDNA
spotted onto a glass slide
• Typically thousands on genes on a slide
3
Multiplicity Problem
• When we make more than one
comparison in a hypothesis testing
situation, p-value interpretation falls
through
• Control of family error rate is necessary in
order to preserve nominal type I error rate
• Various approaches to correct the chance
of making a type I error for multiplicity,
including Tukey, Bonferroni and Holms
4
Microarray Analysis
Techniques
• Westfall Young step down (WY)
• Significance Analysis of Microarrays
(SAM)
• Empirical Bayes (EB)
• Bayesian (MCMC)
• Mixture Modeling
• Dimension reduction techniques
• Machine learning
5
Westfall Young (WY)
• Compute ranks of original test statistic rj
such that
• Construct b balanced permutations of the
samples, computing the same test
(b )
(b )
statistic as above t1 ,, tk for each b
• Compute
and
• Repeat B times and calculate the adjust
p-value as
• Less conservative than Bonferroni
6
Significance Analysis of
Microarrays (SAM)
• Use a t-like statistic
• Use balanced permutation method from
previous slide to estimate null
distribution, assuming all effects are null
• Call genes that fall outside D bars
significant
7
Half-Normal Analysis
8
Microarray Specific Problem
9
Analysis Plan
• Robust measures of location and scale
• Summary statistic
• Two half-normal plots (for upwardregulated and downward-regulated
genes)
• Segment determination
NC
– Find J  , J
– insignificant, borderline, significant
NC
J
• Repeat the procedure, using  as base
10
Robust Measures of Location and
Scale
• Perform transformation and suitable
normalization
• Compute median and Maximum Absolute
Deviation (MAD) for each gene
– Reasonable estimates
– Less affected by outliers than mean and SD
– Interested in robustness rather than efficiency
11
Summary Statistic
• Compute quasi two-sample t-statistic
using robust values from above:
• c is chosen to minimize
for the middle 100*(1-2e)% of the ssl.
• Tusher et al. (2001) chose c to minimize
the coefficient of variation
• Efron et al. (2001) used the 90th
percentile of the gene standard error
estimates for c
12
Two Half-Normal Plots
• Construct two half-normal plots: one for
the p positive and r negative ssl.
• Run the procedure separately on each
set
• Denote the ordered p positive effects by
• Plot abssi against half-normal distribution
quantiles, i.e. the points
1 (.5  .5[i  .5] / p), abss(i ) )
• Goal: obtain set of noise effects
• Yield a baseline against which to test the
rest of the effects
13
Segment Determination: J 
• Given , initialize null set as points abss1
: abssk
• Regress null set on 1:k half-normal
quantiles (Q1:Qk)
• Produce predicted values yˆ h at the
remaining quantile values (Qh:h>k)
• Compute predicted statistics
with
• Find
14
Segment Determination:J  (cont)
• The initial null set of k genes becomes k
+ m (= J k ) null genes
• Now re-do the segment determination
procedure, using the k + m genes as
base null set
• Continue until no new genes are added
• Do for each k less than p-1
• Store the end point J k
k
• Set the most frequent J  to J 
15
Sample
• Let k = 200, total effects = 500
– First 200 ordered positive effects regressed on
first 200 half-normal quantiles
– Test ordered effects 201 to 500 using absolute
value of predicted statistics
– For example, effect 239 is the largest h less than
the t-critical value
k  200
– So J 
would initially be 239
• Redo the above, with k = 239 effects; so we
test effects 240 to 500
– Say statistic 242 is the largest h less than t-critical
value based on new regression line
– So the new J 200 would be 242
• Redo the above again with k = 242, test
effects 243 to 500
– No statistics are less than t critical value
200
• So J  is 242
16
Example
J
J   3116
17
Find J NC
• Will test all effects after J  using same
statistics
• To adjust for multiple testing, define NC
as the number of consecutive significant
effects necessary to call all subsequent
effects significant
• Use the Bonferroni adjustment (does not
require independence):
• Instead of doing thousands of
comparisons, only need to do NC to
determine significance
• Define
• Now we have identified the change points
in the graph for segment detection
18
Example: Downwardregulated Speed Mouse Data
19
Example: Downward Regulated
Speed Mouse Data (cont)
J NC
J
20
Error Rate Estimation: FDR
• False Discovery Rate (FDR) is the
expected proportion of falsely rejected
hypotheses
• Permute the condition labels, maintaining
balance
– Example: 8 replicates in conditions A and B
– Each A’ and B’ will have 4 replicates from A
and 4 from B
– Compute the robust statistics, keeping the
same c from the actual data
• Determine the average number of effects
that fall above the positive or below the
negative boundary of the significant sets
• Divide that number by the total number of
called significant effect
21
Speed Data: Analysis and
Comparison
• WY found 8 genes significant, with Type I
error = 0.05
22
Lemon Data: Analysis and
Comparison
• WY found 253 genes significant, with
Type I error = 0.05
23
Conclusions
• Proposed a new method for determining
differential expression in genes
• Dealt with the multiplicity problem by using
only a small subset of genes
• Can extend to other large data sets
• Allow scientists to play a role in sequential
decision making
• Incorporate a priori knowledge of experiment
with selection of c
24