A Model for Measurement Error for Gene Expression Arrays

Download Report

Transcript A Model for Measurement Error for Gene Expression Arrays

ECS 289A Presentation
Jimin Ding
•
•
•
•
•
•
•
•
Problem & Motivation
Two-component Model
Estimation for Parameters in above model
Define low and high level gene expression
Comparing expression levels
Limitations of the model and method
Other possible solutions
References
A Model for Measurement Error
for Gene Expression Arrays
David Rocke & Blythe Durbin
Journal of Computational Biology Nov.2001
Problem & Motivation
• Statistical inference for data need assumption of normality
with constant variance
--- So hypothesis testing for the difference between control and treatment need
equal variance (not depending on the mean of the data);
• Measurement error for gene expression rises
proportionately to the expression level
--- So linear regression fails and log transformation has been tried;
• However, for genes whose expression level is low or
entirely unexpressed, the measurement error doesn’t go
down proportionately Example
--- So log transformation fails by inflating the variance of observations near
background, and two component model is introduced.
Example: Mice
From: Barosiewics etatl, 2000
From Durbin et.al 2002
back
Two-Component Model

y    e  
•
•
•
•
Y is the intensity measurement
 is the expression level in arbitrary units
 is the mean intensity of unexpressed genes
Error term:


~ N (0,   )
~ N (0,   )
Estimation for background ( &  )

y    e  
 ~ N (0,  ), ~ N (0, )
• Estimation of background using negative
controls
• Estimation of background with replicate
measurements Detail
• Estimation of background without replicate
Estimation of
 & 
with replicate measurements
• Begin with a small subset of genes with low
intensity (10%)
ˆ  xB
ˆ   S B 
m
1
2
s

i ( ni  1)
n  m i 1
• Define a new subset consisting of genes
whose intensity values are in [ xB  2sB , xB  2sB ]
• Repeat the first and second steps until the
set of genes does not change.
Estimation of the High-level RSD
• The variance of intensity in two-component model:
, where
• At high expression level, only multiple error term is noticeable,
so the ratio of the variation to the mean is a constant, i.e. RSD= s
• For each replicated gene that is at high level, compute the mean
ˆ i of the y  ˆ and the standard deviation si of log( y  ˆ )
• Then use the pooled standard deviation to estimate   :
Define “high” and “low”
 2
 0 .9
2
2 2
    s
     / 3s
• Low expression level:
Most of the variance is due
to the additive error
component. 95% CI:
(ˆ 1.96 Var(ˆ ) , ˆ 1.96 Var(ˆ ) )
 2 s2
 0 .9
2
2 2
    s
   3  / s
• High expression level:
Most of the variance is due
to the multiplicative error
component. 95% CI:
Comparing Expression Levels
• Common method: standard t-test on ratio of
expression for treatment and control (low level),
or its logarithm (high level).
• Problem:
Less effective when gene is expressed at a low
level in one condition and high in the other:
Solution
consider treatment and control are correlated
• Model:
• Variation:
Background:
High-level RSD:
Hypothesis testing (Comparison)
•
•
•
Assume the data have been adjusted:
Testing:
(Gene has same expression level at
Control and treatment)
Then using the following approximate variance to do standard
t-test for log ratio of raw data:
Limitations
• No theoretical result for above estimations.
(Consistency and asymptotical distribution)
• Cutoff point of high level and low level is fairly
artificial
• The convergence of estimation of background
information is heavily dependent on data and
initial selection
Literature & Other Possible
Solutions for Measurement Error
• Chen et al. (1997): measurement error is normally
distributed with constant coefficient of variation (CV)—in
accord with experience
• Ideker et al.(2000) introduce a multiplicative error
component (normal)
• Newton et al. (2001) propose a gamma model for
measurement error.
• Durbin et al.(2002) suggest transformation
g ( y )  ln[ y    ( y   ) 2  c ] , where c   2 / s2
• Huber et al.(2002) introduce transformation
h( x)  ar sinh(a  bx)
References
•
•
•
•
Blythe Durbin, Johanna Hardin, Douglas Hawkins, and David Rocke. “A
variancestabilizing transformation from gene-expression microarray data”,
Bioinformatics, ISMB, 2002.
Chen. Y., Dougherty, E.R. and Bittner, M.L.(1997) “Ratio-based decisions and
the quantitative analysis of cDNA microarray images”, J.Biomed. Opt.,2,364374
Wolfgang Huber, Anja von Heydebreck,Martin Vingron (Dec.2002) “Analysis
of microarray gene expression data”, Preprint
Wolfgang Huber, Anja von Heydebreck, Holger S¨ultmann, Annemarie
Poustka, and Martin Vingron. “Variance stablization applied to microarray data
calibration and to the quantification of differential expression”,
Bioinformatics, 18 Suppl. 1:S96–S104, 2002. ISMB 2002.