PowerPoint - Isaac Newton Institute

Download Report

Transcript PowerPoint - Isaac Newton Institute

On Mixture Models in
High-Dimensional Testing
for the Detection of
Differential Gene Expression
Geoff McLachlan
Department of Mathematics & Institute for Molecular Bioscience
University of Queensland
ARC Centre of Excellence in Bioinformatics
http://www.maths.uq.edu.au/~gjm
Microarray Data represented as N x M Matrix
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
M columns (samples) ~
102
N rows (genes) ~
104
The Challenge for Statistical Analysis of
Microarray Data
Microarrays present new problems for statistics
because the data are very high dimensional with
very little replication.
The challenge is to extract useful information and
discover knowledge from the data, such as gene
functions, gene interactions, regulatory pathways,
metabolic pathways etc.
Detection of Differential Expression
• Supervised Classification of Tissue Samples
• Clustering of Gene Profiles
Gene 1
Gene 2
.
.
.
.
Gene N
Sample 2
.
.
.
.
Sample 1
Sample M
Sample 2
.
.
.
.
Sample 1
Gene 1
Gene 2
.
.
.
.
Gene N
Class 1
Class 2
Sample M
Hedenfalk Breast Cancer Data
• cDNA arrays, tumour samples from carriers of either the
BRCA1 or BRCA2 mutation (hereditary breast cancers)
• BRCA1/BRCA2 mutations associated with decreased
ability for DNA repair, yet pathologically distinct
• Dataset of M = 15 patients: 7 BRCA1 vs 8 BRCA2, with
N = 3,226 genes.
The problem is to find genes which are differentially
expressed between the BRCA1 and BRCA2 patients.
Hedenfalk et al. (2001) NEJM, 344, 539-547
An example of a gene from
Hedenfalk et al (2001) breast cancer data
Class 1
BRCA1: n1 = 7 tissues
gene j
Class 2
BRCA2: n2 = 8 tissues
-0.587 -0.5 -0.0707 -0.265 -0.542 -0.522 0.265
-0.7 0.377 0.0318 -0.475 -0.627 -0.56 1.39 -0.4
x1  0.3173, s12  0.1002
x2  0.1203, s22  0.5066
s 2j  0.3190
tj 
x1 j  x2 j
sj
1
n1
 n12
 0.6739
Pj  2 F13 ( | t j |)  0.512
Supervised Classification (Two Classes)
Sample 1
.......
Sample M
.......
Gene 1
Gene N
Class 1
(good prognosis)
Class 2
(poor prognosis)
Selection Bias
Bias that occurs when a subset of the
variables is selected (dimension
reduction) in some “optimal” way, and
then the predictive capability of this
subset is assessed in the usual way; i.e.
using an ordinary measure for a set of
variables.
Selection Bias
Discriminant Analysis:
McLachlan (1992 & 2004, Wiley, Chapter 12)
Regression:
Breiman (1992, JASA)
“This usage (i.e. use of residual of SS’s etc.) has
long been a quiet scandal in the statistical
community.”
Selection bias with SVM using RFE:
GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning)
“The success of the RFE indicates that RFE has a built in
regularization mechanism that we do not understand yet that
prevents overfitting the training data in its selection of gene
subsets.”
Ambroise, C. and McLachlan, G.J (2008).
Selection bias in gene extraction on the basis of microarray gene-expression data.
Proceedings of the National Academy of Sciences USA 99, 6562-6566.
Figure 1: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of colon tissue samples
Nature Reviews Cancer, Feb. 2005
Number of Genes
Error Rate for Top
70 Genes (without
correction for
Selection Bias as
Top 70)
Error Rate for Top
70 Genes (with
correction for
Selection Bias as
Top 70)
Error Rate for
5422 Genes
1
0.40
0.40
0.40
2
0.31
0.37
0.40
4
0.24
0.40
0.42
8
0.23
0.32
0.29
16
0.22
0.33
0.38
32
0.19
0.35
0.38
64
0.23
0.32
0.33
70
0.23
0.32
-
128
-
-
0.32
256
-
-
0.29
512
-
-
0.31
1024
-
-
0.32
2048
-
-
0.35
4096
-
-
0.37
5422
-
-
0.37
Table 2: Ten-fold Cross-validated error rates for SVM
based on subset with minimum error rate in RFE process
with and without bias correction for optimization over
subsets of varying sizes
Dataset
Error rate without bias
correction for varying
size of optimal subset
Error rate with bias
correction for varying
size of optimal subset
van ’t Veer (5422 genes)
0.29
0.37
Analysis of Breast Cancer Data
• van’t Veer et al. (2002, Nature)
• van de Vijver et al. (2002, NEJM)
FDA has approved on 7/2/07 a new genetic test called MammaPrint®, "that
determines the likelihood of breast cancer returning within five to 10 years after a
woman's initial cancer." According to the FDA, the test, a product of Agendia
(Amsterdam, the Netherlands), is the first cleared product that profiles genetic
activity.
MINDACT (Microarray for Node Negative Disease may Avoid Chemotherapy)
prospective project, run by the EORTC (European Organisation for Research and
Treatment of Cancer) and TRANSBIG, the translational research network of the
Breast International Group
• Zhu, J.X., McLachlan, G.J., Ben-Tovim, L., and Wood, I.
(2008). On selection biases with prediction rules formed
from gene expression data. Journal of Statistical
Planning and Inference 38, 374-386.
• McLachlan, G.J., Chevelu, J., and Zhu, J. (2008).
Correcting for selection bias via cross-validation in the
classification of microarray data. In Beyond Parametrics
in Interdisciplinary Research: Festschrift in Honour of
Professor Pranab K. Sen, N. Balakrishnan, E. Pena, and M.J.
Silvapulle (Eds.). Hayward, California: IMS Collections, pp. 383395.
Detection of Differential Expression
• Supervised Classification of Tissue Samples
• Clustering of Gene Profiles
Microarray Data represented as N x M Matrix
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
M columns (samples) ~
102
N rows (genes) ~
104
Clustering of Microarray Data
Clustering of tissues on basis of genes:
latter is a nonstandard problem in
cluster analysis (n =M << p=N)
EMMIX-GENE
Clustering of genes on basis of tissues:
genes (observations) not independent and
structure on the tissues (variables) (n=N >> p=M)
EMMIX-WIRE
Clustering of gene expression profiles
• Longitudinal (with or without replication, for
example time-course)
• Cross-sectional data
EMMIX-WIRE
EM-based MIXture analysis With Random Effects
Ng, McLachlan, Wang, Ben-Tovim Jones, and Ng (2006, Bioinformatics)
Supplementary information :
http://www.maths.uq.edu.au/~gjm/bioinf0602_supp.pdf
In the ith component of the mixture, the profile vector yj for the jth
gene follows the model
y j  Xβ i  Ub ij  Vc i  ε ij
p 1
m 1
qb 1
qc 1
p 1
b ij ~ N (0, H qb )
ci ~ N (0, ci I qc )
 ij ~ N (0, A i )
A i  diag( Wφi ), φi  ( i21 ,,  iq2 e )T
( j  1,  , n )
Example: Hedenfalk Data
n =15 tissues (7 from Class 1; 8 from Class 2) 3226 Genes
y j  Xβ i  Ub ij  Vc i  ε ij
where
 
Xβ i  X i1 ,
 i 2 
1 0 





1 0 
X
,
0
1


  


0
1


 bi1 j 
,
Ub ij  X 
 bi 2 j 
 ci1 


Vc i  I15   
c 
 i ,15 
  i21 
cov( εij )  diag( Wφi )  diag X 2 
 i 2 
• Flack, L.K. and McLachlan, G.J. (2008). Clustering methods
for gene-expression data. In Handbook of Research on Systems
Biology Applications in Medicine, A. Daskalaki (Ed.).
Hershey, Pennsylvania: Information Science Reference. To
appear.
• McLachlan, G.J., Bean, R., and Ng, S.K. (2008). Clustering of
microarray data via mixture models. In Statistical Advances in
Biomedical Sciences, A. Biswas, S. Datta, J. Fine, and M.R.
Segal (Eds.). Hoboken, New Jersey: Wiley, pp. 365-384.
• McLachlan, G.J., Bean, R., and Ng, S.K. (2008). Clustering. In
Bioinformatics, Vol. 2: Structure, Function, and Applications,
J. Keith (Ed.). Totowa, New Jersey: Humana Press, pp. 423439.
Gene 1
Gene 2
.
.
.
.
Gene N
Sample 2
.
.
.
.
Sample 1
Sample M
Sample 2
.
.
.
.
Sample 1
Gene 1
Gene 2
.
.
.
.
Gene N
Class 1
Class 2
Sample M
An example of a gene from
Hedenfalk et al (2001) breast cancer data
Class 1
BRCA1: n1 = 7 tissues
gene j
Class 2
BRCA2: n2 = 8 tissues
-0.587 -0.5 -0.0707 -0.265 -0.542 -0.522 0.265
-0.7 0.377 0.0318 -0.475 -0.627 -0.56 1.39 -0.4
x1  0.3173, s12  0.1002
x2  0.1203, s22  0.5066
s 2j  0.3190
tj 
x1 j  x2 j
sj
1
n1
 n12
 0.6739
Pj  2 F13 ( | t j |)  0.512
Using just the B permutations of the class labels for the
gene-specific statistic tj , the P-value is assessed as:
Pj 
# {b :| t
(b )
j
|| t j |}
B
where t(b)j is the null version of tj after the bth permutation
of the class labels.
If we pool over all N genes, then:
B
Pj  
b 1
#{i :| t
(b )
i
|| t j |, i  1,..., N }
NB
Multiplicity Problem
When many hypotheses are tested, the
probability of a false positive increases sharply
with the number of hypotheses.
Methods for dealing with the Multiplicity Problem
• The Bonferroni Method
controls the family wise error rate (FWER) i.e.
the probability that at least one false positive error will be
made
 Too strict for gene expression data, tries to make it
unlikely that even one false rejection of the null is made,
may lead to missed findings
• The False Discovery Rate (FDR)
emphasizes the proportion of false positives among the
identified differentially expressed genes.
 Good for gene expression data – says something about
the chosen genes
The FDR and other error rates
PREDICTED
Retain Null
Reject Null
Null
a
Non-null
c (false negative) d
TRUE
b
(false positive)
N
r
FDR ~
b
N
r
FNDR ~
c
ac
The FDR and other error rates
PREDICTED
Retain Null
Reject Null
Null
a
Non-null
c (false negative) d
TRUE
b
(false positive)
Nr
FDR ~
b
N
r
FNR =
c
cd
b
FDR  E{
}
Nr  1
where
Nr  1  max( Nr ,1)
c
FNDR  E{
}
( N  Nr )  1
Pj  2 F13 ( | t j |)
z j  1[ F13 (t j )]
Pj  2 F13 ( | t j |)
z j   1 (1  Pj )
Local FDR
Lee (2000), Efron et al (2001), Newton et al.
(2001) proposed a two-component mixture model
f ( z j )   0 f 0 ( z j )  (1   0 ) f1 ( z j )
 0 ( z j )  pr{ jth gene is null | z j }

 0 f0 ( z j )
f (z j )
 0 f0 ( z j )

(by Bayes' theorem)
 0 f 0 ( z j )  (1   0 ) f1 ( z j )
Global and Local FDR
• The global FDR is concerned with the average number of
false positives among the selected genes.
• The local FDR gives a (probabilistic) assessment of
differential expression for each gene.
An approach using mixture models gives estimates for the
local FDR as well as the global FDR and other error rates
such as FNR.
It also allows consideration whether an empirical null
distribution should be adopted in place of the theoretical null.
• McLachlan GJ, Bean RW, Ben-Tovim Jones
L, Zhu JX. (2005).Using mixture models to
detect differentially expressed genes.
Australian Journal of Experimental
Agriculture 45, 859-866.
• McLachlan GJ, Bean RW, Ben-Tovim Jones
L. (2006). A simple implentation of a
normal mixture approach to differential
gene expression in multiclass microarrays.
Bioinformatics 26, 1608-1615.
• Efron B et al (2001) Empirical Bayes analysis of a
microarray experiment. JASA 96,1151-1160.
• Efron B (2004) Large-scale simultaneous hypothesis
testing: the choice of a null hypothesis. JASA 99, 96104.
• Efron B (2004) Selection and estimation for large-Scale
simultaneous inference.
• Efron B (2005) Local false discovery rates.
• Efron B (2006) Correlation and large-scale
simultaneous significance testing.
• Efron B (2006) Size, power and false discovery rates.
• Efron B (2007) Simultaneous inference: when should
hypothesis testing problems be combined.
http://www-stat.stanford.edu/~brad/papers/
An example where local FDR is more
informative: Glonek and Solomon (2003)
F0: N(0,1), π0=0.6
F1: N(1,1), π1=0.4
Reject H0 if z≥2
Reject H0 if z≥2
τ0(2) = 0.99972
but FDR=0.17
τ0(2) = 0.251
but FDR=0.177
0.3
0.3
0.4
0.4
F0: N(0,1), π0=0.9
F1: N(1,1), π1=0.1
null
0.2
0.2
null
0.1
0.1
non-null
0.0
0.0
non-null
-4
-2
0
2
4
-4
-2
0
2
4
The Procedure
1. Obtain the z-score for each of the genes
zj  
1
1  P 
j
2. Rank the genes on the basis of the z-scores, starting
with the largest ones (the same ordering as with the Pvalues, Pj).
3. The posterior probability of non-differential
expression of gene j, is given by  0(zj).
4. Conclude gene j to be differentially expressed if
ˆ 0(zj) < c0
If we conclude that gene j is differentially
expressed if:
ˆ 0(zj)  co,
then this decision minimizes the (estimated)
Bayes risk
Risk  (1  co ) 0e01  co1e10
where e01 is the probability of a false positive and
e10 is the probability of a false negative.
Suppose 0(z) is monotonic decreasing in z.
Then
ˆ0 ( z j )  c0 for z j  z 0
1  F 0( z 0)
ˆ
FDR  ˆ 0
1  Fˆ ( z 0)
E 0( z) | z  z 0
Suppose 0(z) is monotonic decreasing in z. Then
ˆ0 ( z j )  c0
for
z j  z0
1  F 0( z 0)
ˆ
FDR  ˆ 0
1  Fˆ ( z 0)
where
F 0( z 0)   ( z 0)


ˆ
z


0
1

Fˆ ( z 0)  ˆ 0( z 0)  ˆ1
 ˆ1 
For a desired control level a, say a = 0.05, define

z0  arg min FDˆ R( z)  a
z
If
1  F 0( z )
ˆ 0
1  Fˆ ( z )

(1)
is monotonic in z, then using (1)
to control the FDR [with ˆ 0  1 and Fˆ ( z ) taken
to be the empirical distribution function] is
equivalent to using the Benjamini-Hochberg
procedure based on the P-values corresponding
to the statistic zj.
LOCAL FDR
(POSTERIOR) PROB. Of GENE j BEING NOT
DIFFERENTIALLY EXPRESSED
 0 f0 ( z j )
0(z j ) 
 0 f 0 ( z j )  (1   0 ) f1 ( z j )
N(0,1)
 0 f00 (z
( zjj))
 0 (z j ) 
 0 f 0 ( z j )  (1   0 ) f1 ( z j )
In order to proceed with estimation of π0 (can easily
estimate f(zj) from z1,…,zN) we need to make the
problem identifiable.
Now f0(zj) is N(0,1) and we have to assume something
about f1(zj).
N(0,1)
 0 f00 (z
( zjj))
 0 (z j ) 
 0 f00(z
( zjj))  (1   0 ) ff1 ((zz j))
1
N(0,1)
j
N(μ1,σ12)
Z-values, null case
Z-values, +2
Z-values, +1
Z-values, +3
EMMIX-FDR
A program has been written in R which
interfaces with EMMIX to implement the
algorithm described in McLachlan et al.
(2006).
We fit a mixture of two normal components
to the P-values of the test statistics
calculated from the gene expression data.
Fit
π0N(0,1) + (1- π0)N(μ1,σ12)
via maximum likelihood.
For given π0, MLEs of μ1, σ12 are determined: try
various π0.
When we equate the sample mean and variance
of the mixture to their population counterparts, we obtain:
z  ˆ 0 ˆ 0  ˆ1ˆ1
s  ˆ 0ˆ  ˆ1ˆ  ˆ 0ˆ1 ( ˆ 0  ˆ1 )
2
z
2
0
2
1
2
When we are working with the theoretical null,
we can easily estimate the mean and variance
of the non-null component with the following
formulae.
ˆ12  {s z2  ˆ 0  ˆ 0 (1  ˆ 0 ) ˆ12 } /(1  ˆ 0 )
ˆ1  z /(1  ˆ 0 )
Estimated FDR
N
FDˆ R  ˆ0 ( z j ) I[ 0,c0 ] (ˆ0 ( z j )) /N r
j 1
where
N
N r   I[ 0,c0 ] (ˆ0 ( z j ))
j 1
Similarly, the false positive rate is given by
N
N
j 1
j 1
FPˆ R  ˆ0 ( z j ) I[ 0,c0 ] (ˆ0 ( z j )) / ˆ0 ( z j )
and the false non-discovery rate and false negative
rate by:
N
FNDˆ R  ˆ1 ( z j ) I ( c0 , ) (ˆ0 ( z j )) /( N  N r )
j 1
N
N
j 1
j 1
FNˆ R  ˆ1 ( z j ) I ( c0 , ) (ˆ0 ( z j )) / ˆ1 ( z j )
-4
-2
0
2
4
0
50
100
150
60
Null
20
40
Non–Null
(DE genes)
0
Frequency
80
100
Fitting two component
mixture
model
to Hedenfalk data
Histogram of z-scores
for 3226 Hedenfalk
genes
-4
-2
0
z
2
4
Ranking and Selecting the Genes
Gene j
Gene 1
.
.
.
Gene 143
.
.
.
.
Gene 200
.
.
.
Gene N
Pj
zj
Local FDR
ˆ ( z )
0
j
0.0108
.
.
.
0.0998
0.1004
0.1010
.
.
0.1252
FDR
= Sum/143
= 0.06
co = 0.1
Proportion of
False Negatives
= 1 – Sum1/ 57
= 0.89
co
Nr
FDˆ R
FNˆ DR
FNˆ R
FPˆ R
0.1
143
0.06
0.32
0.88
0.004
0.2
338
0.11
0.28
0.73
0.02
0.3
539
0.16
0.25
0.60
0.04
0.4
742
0.21
0.22
0.48
0.08
0.5
971
0.27
0.18
0.37
0.12
Estimated FDR and other error rates for various levels
of threshold co applied to the posterior probability of nondifferential
expression for the breast cancer data (Nr=number of selected genes)
Comparison of identified DE genes
Our method (143)
Hedenfalk (175)
24
6
12
29
101
39
8
Storey and Tibshirani (160)
Storey and Tibshirani (2003) PNAS, 100, 9440-9445
Uniquely Identified Genes:
Differentially Expressed between BRCA1 and BRCA2
Gene
UBE2B, DDB2
(UBE2V1)
RAB9, RHOC
ITGB5, ITGA3
PRKCBP1
HDAC3, MIF
KIF5B, spindle body
pole protein
CTCL1
TNAFIP1
HARS, HSD17B7
GO Term
DNA repair
(cell cycle)
small GTPase signal transduction
integrin mediated signalling pathway
regulation of transcription
negative regulation of apoptosis
cytoskeleton organisation
vesicle mediated transport
cation transport
metabolism
Estimates of π0 for Hedenfalk data
• 0.52 (Broet, 2004)
• 0.64 (Gottardo, 2006)
• 0.61 (Ploner et al, 2006)
• 0.47 (Storey, 2002)
Using a theoretical null, we estimated π0 to be
0.65.
Theoretical and empirical nulls
Efron (2004) suggested the use of two kinds
of null component: the theoretical and the
empirical null. In the theoretical case the
null component has mean 0 and variance 1
and the empirical null has unrestricted mean
and variance.
From Efron (2006)
Theoretical null may not hold for 4 reasons
1. Failed assumptions
• Maybe non-normality distorts student’s t-distribution
• Can use permutation methods
2. Correlation across arrays
• Student-t null density assumes independence
across arrays
• Permutation methods cannot help
3. Unobserved covariates (age, weight, stage)
• Tend to widen null density of the zj’s
• Permutation methods cannot help
4. Correlation across genes
ˆ0 ( z j )   0 f 0 ( z j ) / fˆ ( z )
Estimation of f(z) does not require independence
of zj’s
Suppose (1), (2), or (3) is applicable but (4) is not
(assume genes independent).
null Zj may not be ~ N(0,1)
i.e. theoretical null may not hold
Thus: use empirical null
 0 f00 (z
( z jj)
 0 (z j ) 
 0 f00 (z
( z jj )  (1   0 ) f11((z
z j j))
N(μ0,σ02)
N(μ1,σ12)
μ0, σ02 are now estimated from the data.
Call N(μ0, σ02) the empirical null distribution.
Problem now is to fit
 0 N ( 0 ,  )  (1   0 ) N ( 1 ,  )
2
0
2
1
1. Specify an initial value of π0 (try theoretical
null estimate and other estimates as before)
2. Rank zj’s and put Nπ0 smallest in null
component and remainder in non-null component
3. Work out means/variances as if they are the true
groups
Can check for need of empirical null in place
of theoretical null by comparing
twice the increase in the log likelihood
when fitting μ0, σ02 instead of
fixing μ0=0 and σ02=1.
From Efron (2006)
Now suppose the zj’s are correlated (4th reason).
Even if theoretical null N(0,1) is correct for an
individual zj of a null gene, the zj’s for the null genes
may not behave as N(0,1) variates in the ensemble
of z1,…,zN.
If they don’t, then the Benjamini-Hochberg
procedure will break down using P-values based on
theoretical null.
Fit
 0 N ( 0 ,  )  (1   0 ) N ( 1 ,  )
2
0
2
1
Still using maximum likelihood, although the function we
are maximizing is no longer the true likelihood due to correlation
across the genes.
100
80
60
0
20
40
Frequency
-4
-2
0
2
4
z-scores
Breast cancer data: plot of fitted two-component
normal mixture model with theoretical N(0,1) null and non-null
components (weighted respectively by the estimated proportion of null and
non-null genes) imposed on histogram of z-scores.
100
80
60
0
20
40
Frequency
-4
-2
0
2
4
z-scores
Hedenfalk breast cancer data:
plot of fitted two-component normal mixture model with empirical null
and non-null components (weighted respectively by the estimated proportion
of null and non-null genes) imposed on histogram of z-scores.
co
Nr
FDˆ R
FNˆ DR
FNˆ R
FPˆ R
0.1
143
0.06
0.32
0.88
0.004
0.2
338
0.11
0.28
0.73
0.02
0.3
539
0.16
0.25
0.60
0.04
0.4
742
0.21
0.22
0.48
0.08
0.5
971
0.27
0.18
0.37
0.12
Table 1. Theoretical Null
co
Nr
FDˆ R
FNˆ DR
FNˆ R
FPˆ R
0.1
62
0.07
0.23
0.93
0.00
0.2
212
0.13
0.20
0.77
0.01
0.3
343
0.17
0.18
0.64
0.02
0.4
504
0.23
0.15
0.51
0.05
0.5
644
0.28
0.13
0.41
0.07
Table 2. Empirical Null
co
Nr
FDˆ R
FNˆ DR
FNˆ R
FPˆ R
0.1
143
0.06
0.32
0.88
0.004
62
0.07
0.23
0.93
0.00
338
0.11
0.28
0.73
0.02
212
0.13
0.20
0.77
0.01
0.2
Table 3. Theoretical versus Empirical Null
Allison Mice Simulation
Allison et al. (2002) generated data for 10 mice over 3000
genes. The data are generated in six groups of 500 with a
value ρ of 0, 0.4, or 0.8 in the off-diagonal elements of the
500 x 500 covariance matrix used to generate each group.
For a random 20% of the genes, a value d of 0, 4, or 8 is
added to the gene expression levels of the last five mice.
Ben-Tovim Jones, L., Bean, R.W., McLachlan, G.J., and Zhu, J.X. (2006). Mixture
models for detecting differentially expressed genes in microarrays. International
Journal of Neural Systems 16, 353-362.
Theoretical null, ρ=0.8, d=4
Empirical null, ρ=0.8, d=4
Theoretical null, ρ=0.8, d=8
Empirical null, ρ=0.8, d=8
Clustering of gene expression profiles
• Longitudinal (with or without replication, for
example time-course)
• Cross-sectional data
EMMIX-WIRE
EM-based MIXture analysis With Random Effects
Ng, McLachlan, Wang, Ben-Tovim Jones, and Ng (2006, Bioinformatics)
Supplementary information :
http://www.maths.uq.edu.au/~gjm/bioinf0602_supp.pdf
Example: Hedenfalk Data
n =15 tissues (7 from Class 1; 8 from Class 2) 3226 Genes
y j  Xβ i  Ub ij  Vc i  ε ij
where
 
Xβ i  X i1 ,
 i 2 
1 0 





1 0 
X
,
0
1


  


0
1


 bi1 j 
,
Ub ij  X 
 bi 2 j 
 ci1 


Vc i  I15   
c 
 i ,15 
  i21 
cov( εij )  diag( Wφi )  diag X 2 
 i 2 
 i ( y j , c;  )  pr{Z ij  1 | y j , c}

 i f ( y j | zij  1, ci ; i )

N(i,Bi, with

f
(
y
|
z

1
,
c
;

)
h
j
hj
h
h
h 1
g
i  Xi  Vci
Bi  Ai   biUU
T
co
Nr
FDˆ R
FNˆ DR
FNˆ R
FPˆ R
0.1
62
0.07
0.23
0.93
0.00
257
0.06
0.32
0.79
0.01
212
0.13
0.20
0.77
0.01
480
0.10
0.27
0.63
0.02
0.2
Table 4. Empirical versus Clustering Approach
Summary
• Mixture model based approach to finding
DE genes is both convenient and effective
• Gives measure of local as well as global
FDR; also gives other error rates
• Provides an empirical null for use when
theoretical null may be misleading