Ratio statistics of gene expression levels and

Download Report

Transcript Ratio statistics of gene expression levels and

(2) Ratio statistics of gene expression
levels and applications to microarray
data analysis
Bioinformatics, Vol. 18, no. 9, 2002
Yidong Chen, Vishnu Kamat, Edward R. Dougherty,
Michael L. Bittner, Paul S. Meltzer1, and Jeffery M.
Trent
Outline

Introduction

Ratio Statistics

Quality Metric for Ratio Statistics

Conclusion
Introduction

Motivation
Expression-based analysis for large families of genes
has recently become possible owing to the development
of cDNA microarrays, which allow simultaneous
measurement of transcript levels for thousands of genes.
For each spot on a microarray, signals in two channels
must be extracted from their backgrounds. This requires
algorithms to extract signals arising from tagged mRNA
hybridized to arrayed cDNA locations and algorithms to
determine the significance of signal ratios.
Introduction

Results
1. estimation of signal ratios from the two channels,
and the significance of those ratios.
2. a refined hypothesis test is considered in which the
measured intensities forming the ratio are assumed
to be combinations of signal and background. The
new method involves a signal-to-noise ratio, and for
a high signal-to-noise ratio the new test reduces
(with close approximation) to the original test. The
effect of low signal-to-noise ratio on the ratio
statistics constitutes the main theme of the paper.
3. a quality metric is formulated for spots
Ratio Statistics
Ratio Statistics assuming a
constant coefficient of variation

Consider a microarray having n genes, with red and
green fluorescent expression values labeled by
R1 , R2 ,..., Rn and G1 , G2 ,...,Gn , respectively.


 H 0 :  Rk   Gk
Hypothesis test: 

 H1 :  Rk   Gk

 Rk  c Rk


Assumption: 

 Gk  c Gk
under H 0

 Rk   Gk
Ratio Statistics assuming a
constant coefficient of variation
(cont.)

Ratio test statistics: Tk  Rk / Gk

Assuming Rk and Gk to be normally and
identically distributed, T has the density function
k
fTk (t; c) 

1
ˆc 
n
(1  t ) 1  t 2
c(1  t )
n
2 2
(ti  1) 2
 (t 2  1)
i 1
i
2
exp[
 (t  1) 2
2c (1  t )
2
],
Ratio Statistics assuming a constant
coefficient of variation (cont.)


self-self experiment
Duplicate
T  t / t',
log Tk  log t k  log t k'
 (log Rk  log Rk' )  (log Gk  log Gk' ),
1
c
n
n

i 1
( log Rk ) 2   log Rk
where  log R  (log R  log  R ).
Ratio Statistics assuming a
constant coefficient of variation
(cont.)
Therefore,
2
2
2
2
 logT  ( log


)

(



R
log R '
log G
log G ' )
 4c 2

Confidence interval
1. Integrating the ratio density function
2. The C.I. is determined by the parameter c, one can
either use the par. derived from pre-selected
housekeeping genes or a set of duplicate genes.
Ratio Statistics for low signalto-noise ratio

The actual expression intensity measurement is of the
form
Rk  (SRk  BRk )   BRk
where SRk is the expression intensity measurement
of gene k ,
BRk is the fluoresent background level, and
 BR k is the mean background level
Ratio Statistics for low signalto-noise ratio (cont.)

Null hypothesis of interest:

 Rk  E[ Rk ]
 SRk   SGk
 E[(SRk  BRk )   BRk ]
  SRk
 H 0 :  SRk   SGk  H 0 :  Rk  Gk
test statistics: Tk  Rk / Gk
Ratio Statistics for low signalto-noise ratio (cont.)

Major difference:
1. the assumption of a constant cv applies to SRk
and SGk , not to Rk and Gk
2. the density of

Tk is not applicable
SNR (signal-to-noise ratio)
SNR (signal-to-noise ratio)

Assuming that SRk and BRk are independent,
2
2
2
 R2k   SR



(
c

)


BRk
SRk
BRk
k
SNRRk
cR2k
R
 k
 R
 k
2
 SRk
E[ SRk ]


E[ BRk   BRk ]   BRk  BRk

(c SRk )
 
2


SRk

2
2
  BR
k
c 
2
2
 BR
k
2
 SR
k
 1
c 
 SNRR
k

2




2
The Expression intensity scatter plot
Confidence interval for the test
statistics

Assumption: SRk , SGk , BRk , BGk are normally distribute d
and independen t.
( SRk  BRk )   BRk
Rk
Tk 

Gk ( SGk  BGk )   BGk

T
N ( p, p )  N (  BR , BR )   BR
N ( p, p )  N (  BG , BG )   BG
under H 0 , p   SRk   SGk (  Rk  Gk )
Confidence interval for the test
statistics (cont.)

Under the assumption of constant cv for the signal
(without the background),   cp
p

 B  max{ BR , BG } (variance par.)
s  p / B
(signal - to - noise ratio)
   BR /  BG
(background std ratio)
N ( s B , cs B )  N (0, BG )
T
N ( s B , cs B )  N (0, BG )
The 99% confidence interval for ratio statistic
c  0.2,
(a)  BR   BG  100 (or   1)
(b)   1
Correction of background
estimation

Owing to interaction between the fluorescent
signal and background, local-background
estimation is often biased.

To estimate the bias difference, we find the
relationship between the red and green
intensities under the null hypothesis by
assuming a linear relation, G = aR+b.
Correction of background
estimation (cont.)
T

N ( p, p )  N (0, BG )
N ( p, p )  N (0, BG )
Simulation
1. generate 10,000 data points from exp. dist. with
2,000 to simulate 10,000 gene expression levels,
2. The intensity measurement for each channel is
further simulated by using a normal dist. with mean
intensity from the exp. dist. and a constant cv of 0.2
3. simulate background level by a normal dist.
(1) no bias: background level ~ N (0,100)
(2) some bias: background level ~ N (b,100)
Scatter plot of simulated expression data
(a) 10,000 data points with no bias from background estimation
(b) 10,000 data points with background estimation bias of 500
dog-leg effect
Correction of background
estimation (cont.)

G = aR+b
we employ a chi-square fitting method that minimizes
N
(Gk  (aRk  b)) 2
k 1
 R2k   G2 k
2 

b  k 1 N
2
2
2 1
ˆ
ˆ
(
c
(
R

G
)

2


2

BR
BG )
k 1 k k
N
2
2 1
(c 2 ( Rk  Gk )  2ˆ BR
 2ˆ BG
) (Gk  Rk )
Quality Metric for Ratio Statistics

For a given cDNA target, the following factors
affect ratio measurement quality:
(1)
Weak fluorescent intensities
A smaller than normal detected target area
A very high local background level
A high standard deviation of target intensity
(2)
(3)
(4)
(1)Fluorescent intensity
measurement quality

Under the null hypothesis, the signal means are
equal, so that
min{ SNRR , SNRG } 

R

 R
max{ BR , BG }  B
We replace  R and  B by their null - hypothesis
estimators, (R  G)/2 and ˆ B , to obtain

0,

 RG
wI  
,
 6 ˆ B
 1,


RG
3
2 ˆ B
RG
3
6
2 ˆ B
otherwise
(2)Target area measurement
quality

Let AM be the area of mask of the cDNA target
for a particular print - tip, and let ATk be the area
of the two largest connected components of the
target k .

The proportional are a of each target is
a k  ATk / AM .
We define the are a measurement quality
by
0,
a  smin  max{10 / AM ,0.05}

 a-smin
wa  
, smin  a  sb  0.20
 smin  sb
1,
otherwise
(3)Background flatness quality

Define background flatness
wb  min{ wBR , wBG }, where
1,
BRk   BR  4 BR

 (  BR  6 BR )  BRk
wBR  
,  BR  4 BR  BRk   BR  6 BR
3 BR

0,
BRk   BR  6 BR

and wBG is defined similarly.
(4)Signal intensity consistency
quality
Typical target shap
cv=0.48
cv=0.81
cv=0.45
cv=0.98
cv=0.31
cv=0.59
(4)Signal intensity consistency
quality (cont.)
Letting cvmin,k denote the minimun between
the intensity coefficient of variation for the red
and green channels,
1.1  cvmin,k
0,

 cvmin,k  0.9
ws  
, 0.9  cvmin,k  1.1
0.2

cvmin,k  0.9
1,
