Comparison of Statistical Models for Affymetrix GeneChip
Download
Report
Transcript Comparison of Statistical Models for Affymetrix GeneChip
Theoretical and experimental comparisons of
gene expression indexes for
oligonucleotide microarrays
William J. Lemon, Jeffrey J.T. Palatini,
Ralf Krahe, Fred A. Wright
Department of Biostatistics, University pf North
Carolina, Chapel Hill
Division of Human Cancer Genetics
Ohio State University
Measuring gene expression with the Affymetrix
GeneChip
polyA
Coding portion of gene X
Perfect Match (PM)
Mismatch (MM)
...
PM - 25 bases complementary to region of gene
MM - Middle base is different
•cRNA from sample mRNA is put on the chip
•intensity of binding reflects gene expression
Reproducibility of Probe Sensitivities
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
The Li-Wong Model
Li-Wong Full (LWF)
PM ij j j i j i e
MM ij j j i e,
e ~ N (0, 2 )
Identifiability constraint
Li-Wong Reduced (LWR)
2
j J
j
yij PM ij MM ij j i ,
~ N (0, 2 ), 2 2 2
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
The Li-Wong Model
Li-Wong Full (LWF)
ith array
jth probe pair
PM ij j j i j i e
MM ij j j i e,
e ~ N (0, 2 )
Identifiability constraint
Li-Wong Reduced (LWR)
2
j J
Total no. probe pairs
j
yij PM ij MM ij j i ,
~ N (0, 2 ), 2 2 2
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
The Li-Wong Model
expression
Li-Wong Full (LWF)
ith array
jth probe pair
PM ij j j i j i e
MM ij j j i e,
sensitivities
e ~ N (0, 2 )
Identifiability constraint
Li-Wong Reduced (LWR)
2
j J
Total no. probe pairs
j
yij PM ij MM ij j i ,
~ N (0, 2 ), 2 2 2
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
How to compare gene expression indexes?
•We get maximum likelihood estimates for using either
full data (LWF) or reduced data (LWR)
•The Affymetrix software computes:
Average Difference (AD) ˆ y j J .
j
Log-Average (LA)
10 log( PM j / MM j ) / J
j
•The log-average might perform particularly poorly. Note
that if terms are small and error variance is small,
( PM j / MM j ) ( j j ) /( j ) ( j j ) /( j )
•We gain insight by assuming Li-Wong model is true.
Then what are the consequences?
•For large sample sizes, the ’s and ’s will be wellestimated
Compare LW estimators directly:
2
2
(
)
j
j
j
var(ˆreduced )
j
2.0
RE( full, reduced)
2 j
J
var(ˆ full )
Comparing to AD is tricky, but with a correction
factor AD is also an unbiased estimate of :
ˆˆ J j ˆ
j
var(ˆˆ )
1
RE (reduced, AD)
1.0
ˆ
var( reduced ) 1 var( )
•This also gives insight into “perfect match only” analyses:
RE(full, PM-only)=
j
2
var(ˆPM )
j
1
2
var(ˆ full )
( j j )
j
and
1 RE 2
Furthermore, PM-only is always at least twice as efficient as
LWR
Empirical Comparisons
•We propose that an expression index is “good” if it has a high
correlation with the underlying true expression (which is usually
unknown).
•this correlation can be estimated using a specially designed
mixing experiment
•if r is the correlation coefficient between the measured index and
true expression, the “relative efficiency” of two indexes and
can be estimated as
r /(1 r )
2
2
r /(1 r )
2
2
Suppose the true underlying gene expression for a
given gene is . Consider two indices of gene
expression
ˆ 0 1 e , e ~ N (0, 2 )
ˆ 0 1 e , e ~ N (0, 2 ).
ˆˆ
(ˆ 0 ) / 1
And we have
is an unbiased estimate of
ˆˆ
var( )
2
/ 1
2
ˆˆ ) / 1
var(
RE (ˆ,ˆ )
2
ˆˆ / 2
1
var( )
2
2
Can we estimate this relative efficiency?
•Suppose we could do a regression of ˆ on .
•the ratio of explained to residual variance in the model can
be shown to be
1 var( ) /
2
2
2
r
2
1 r
and similarly for ̂ , so
r /(1 r )
2
2
r /(1 r )
2
2
RE (ˆ,ˆ )
Can we estimate r without ever
knowing true expressions ?
•Yes, with a specially designed mixing experiment
•we seek two contrasting conditions in which many
genes will be differentially expressed
Experimental Design
(6 replicates for each condition)
Human Fibroblasts
(GM 08330)
20% FBS
Cell culture
5 passages
20%
Serum starvation
Serum stimulation
0.1% FBS
48h
Harvest total RNA
0.1%
20% FBS
24h
Harvest total RNA
RNA extraction
Produce 50:50 group
Produce duplicates
each day for 3d
Add Bacterial
Control Genes
Starved
50:50
Dap, Thr
Dap, Thr,
Lys, Phe
Stimulated
Lys, Phe
Synthesize cDNA,
cRNA; fragment
Add Hybridization
Control Genes
Hybridize
Data Reduction
BioB, BioC, BioD, Cre
HuGeneFL
Gene Expression Indexes
BIN1 expression
ˆ full
Stim
50:50
Starved
True expression = average of Stim, Starved
BIN1 expression
ˆ full
Stim
1
50:50
2
Starved
3
Note that
r rˆ ,
rˆ
,X
or
r
ˆ , X
Where X=1, 2, 3 (say) for Stim, 50:50 Starved,
respectively
Overall intensity higher in Stimulated
Mean probe
intensity per
array
Stim
50:50
Starved
Coefficients of variation for assay (individual probes)
and gene expression indexes
0 .1 2 1
#
g
e
n
s
0 20 40 60 80
#
g
e
n
s
0 50 10 150 20 250
#
P
r
o
b
e
s
0 20 60 10
As s ay
LW
Sti
F
A
m
St
ffy
0 .2 9
0 .1 4 9
0 .0
0 .5
1 .0
1 .5
2 .0
00
.0
1
.5
1
.0
2
.5
.0
00
.0
1
.5
1
.0
2
.5
CV
CV
CV
Correlation matrix of 18 arrays as a colorized
image for each expression index.
LWR
LWF
Starved
50:50
Stim
LA
AD
Starved
50:50
Stim
Stim
50:50
Starved
Stim
50:50
Starved
Full Model
Affymetrix Ave Diff
Strv 2
Strv 3
Strv 1
Strv 6
Strv 5
Strv 4
Stim 2
Stim 4
50:50 1
Stim 1
Stim 6
Stim 3
Stim 5
50:50 3
50:50 5
50:50 4
50:50 2
50:50 6
Stim 2
Strv 1
Strv 3
Strv 2
Strv 6
Strv 5
Strv 4
Stim 1
Stim 6
Stim 3
Stim 5
Stim 4
50:50 5
50:50 4
50:50 3
50:50 2
50:50 1
50:50 6
Strv 3
Strv 4
Strv 6
Strv 5
Strv 2
Strv 1
Stim 2
Stim 1
Stim 4
Stim 5
Stim 6
Stim 3
50:50 5
50:50 4
50:50 2
50:50 1
50:50 6
50:50 3
Strv 1
Strv 4
Strv 2
Strv 5
Strv 3
Strv 6
50:50 3
50:50 5
50:50 4
50:50 2
50:50 1
50:50 6
Stim 4
Stim 6
Stim 5
Stim 3
Stim 1
Stim 2
Comparing Models
Cluster Analysis
Reduced Model
Affymetrix Log Ave
LA
AD
Unscaled
LWR
LWF
LA
AD
LWR
LWF
0. 0.5 1.0 1.5
Median(r2/(1-r2))
Relative Efficiency
Scaled
Correlation of duplicate measurements of 149 genes
LWF median r=.74
LWR median r=.43
AD median r=.08
LA median r=.17
Number of unexpressed genes
•Only 0.2% of the LW estimates are negative
•50:50 group has fewest negative estimates
•could this indicate very few unexpressed genes?
Stim
50:50
Starved
A conservative approach to estimating
number of unexpressed genes
•Let U denote number of unexpressed genes
•genes are ranked according to expression index
U 2 (median rank of U genes among all genes)
•This is useful if we can get a random sample of
unexpressed genes
Unexpressed population
Gene expression index
•We use the spiked-out bacterial control genes as a
sample of “unexpressed” genes
•the 4 genes are are represented 3 times each (different
portions of mRNA), for a total of 12 probe sets
•Based on this reasoning, we estimate that greater than
88% of the genes are expressed, even in the Starved
samples
Rank of expression index variance across the 6
Stimulated arrays versus rank of index mean
AD
R20 ank(vr) 40 60
R
a
n
k
(
v
r
)
0 20 40 60
LWF
Truly absent in stim
group
D ap
Thr
P he
Lys
0
2000
4000
600
0
2000
4000
6000
Rank(
Rank(mean
Very low estimated expression for
truly absent genes when using LWF
Present/absent calls
•We use the statistic
ˆ
z
SE (ˆ)
to declare genes present/absent (absolute call)
•we find the vast majority of genes on the array appear to be
present
•for the spiked in/out genes, we find vastly improved
present/absent calling using LW estimates
10. -FalseN0g.2tivRae (Sensitv0y.4) 0.6 0.8 1.0
ROC curve - spiked in/out genes
LWF-Z
LWR-Z
Untrimmed AD
LA
AD
Untrimmed LA
Absolute Call
0.0
0.2
0.4
0.6
0.8
1.0
F alse
(1
S
Variability in estimates
Reduced Model
Full Model
log(variance)
Stim
50:50
Starved
log(mean)
Conclusions
•
•
•
•
•
•
Model-based estimators are superior to simple averaging
Full model superior to reduced
this does not necessarily mean that the mismatch probes
are a good idea - but if they are present we should use
them
we have demonstrated this using both analytic
considerations and experimental data
a carefully designed experiment can be used to address
many issues
Many more genes may be expressed than previously
thought
Other issues/ future work
•Spiking genes might be used to calibrate and normalize
arrays
•relationship between variance and mean of expression
indexes may be useful in planning experiments
•our data may be useful for future work, especially in
producing indexes that are resistant to probe saturation
•all primary data, this Powerpoint presentation and a preprint
are available at http://thinker.med.ohio-state.edu