Analyzing Factorially designed microarray experiments

Download Report

Transcript Analyzing Factorially designed microarray experiments

Analyzing Factorially designed
microarray experiments
Scholtens, D. et al.
Journal of Multivariate Analysis, to appear
Presented by
M. Carme Ruíz de Villa &
Alex Sánchez
Introduction
Complexity of genomic data



The functioning of cells is a complex
and highly structured process
Tools are being developed that allow us
to explore it in a multitude of ways
Many of these tools rely on the results
of microarray expression experiments
Genes interact …
protein
phosphatase
transcription
factor
protein
kinase
inactive
transcription
factor
inactive
P
protein
Gene 1
DNA


Gene 2
P
active
active
Gene 3
Gene 4
Gene 5
Treatments are applied in living dynamic cells
mRNA abundance is affected by transcription factors,
protein complexes, methylation, etc…
The holy grial


The holy grial of functional genomics is
the reconstruction of genetic networks
(Wagner 2001)
(We claim that) Factorial experiments



are simple to perform and
can help to reach this goal if
a proper design and analysis is performed
Factorially designed experiments for
microarrays

We can obtain expression data on the balanced
application of the factors, under the four conditions


Many studies are meant to pinpoint the
perturbation of genetic networks by
combinations of factors
Practicality may lead to select genes of
interest according to multiple pairwise
fold change values without exploiting
the use of replicates or modeling to
assess statistical significance

Biologically interpretable and
statistically reasonable models are
necessary to


take the most of the experiment and
make questions of interest answerable
The experiment
Targets

A target of a factor is a gene whose
expression ([mRNA]) is altered by the
presence of the factor


a primary target is a target that is directly
affected by the factor
a secondary target is a target whose
expression is altered only via the effects
of some other gene (can be traced back
to one or more primary targets)
Experimental questions


Experiment on cells from an estrogen
receptor positive human breast cancer
cell lines (MCF-7) is performed.
Questions of interest


Which genes are targets of estrogen?
Can we differentiate between primary and
secondary targets?
Experimental design



MCF-7 cells: ER+ breast cancer cell line
Biologically independent replicates of each treatment
condition in a 2x2 factorial experiment (8 samples).
Factor 1: estrogen (ES)


Factor 2: cyclohexamide (CX)


Upon binding to ES, ER acts as a transcription factor for
certain genes
Universal translation inhibitor, i.e., mRNA can be transcribed,
but it is not translated into protein
mRNA abundance was measured using Affymetrix
HGU95Av2 microarrays
Answering the questions …


We identify as targets all genes whose
expression of mRNA is affected by the
application of ES
A target can be either primary or secondary


primary if ES directly affects expression of mRNA
secondary if mRNA production is affected by some
other gene (can be traced back to a primary
target)
Different scenarios


The presence of ES and/or CX can
affect different targets in different ways
Several simplified scenarios considering
some possibilities are shown below
Scenario 1
Scenario 3
Statistical models
The linear model

Assume the following linear model for
the observed expression value (possibly
on transformed data):
yig  g   Eg x1i  CXg x2i   E:CX ,g x1i x2i  ig


i indexes chips and g indexes genes
x1 indicates the presence of ES and x2
indicates the presence of CX
The meaning of the model
yig   g   Eg x1i  CXg x2i   E:CX , g x1i x2i  ig
None
yig=g+ig
CX only
yig= g +CX,g+ig
ES only
ES and CX
yig= g +ES,g+ig yig= g CX,g+ES,g+CX:ES,g+ig
Inference

Assuming normality (which arises from
log-transformation) linear models
theory can be applied to



Obtain unbiased and efficient estimates of
ES, CX and ES:CX.
Obtain measures of precision for estimates
Perform hypothesis testing
Parameters interpretation

ES interpreted as the effect of ES



CX interpreted as the effect due to CX


genes for which ES is different from zero are
potential targets
not all targets will have ES different from
zero
if CX is different from zero  production of
mRNA is translationally regulated
ES:CX interpreted as “what is left” after
considering each main effect separately
Parameter values for scenario 1
mRNAA
mRNAB
CX
=0
=0
ES
>0
>0
ES:CX
=0
<0
Parameter values for scenario 3
mRNAA
mRNAB
CX
<0
>0
ES
<0
>0
ES:CX
<0
<0
ES target identification


A gene identified as an ES target if
ES 0 or CX:ES  0, that is if the hypothesis
H0: ES=CX:ES  0 is rejected
If a gene is a ES target, then it is



A primary ES target if ES+CX:ES  0 or
A secondary ES target if ES+CX:ES = 0
This can be decided on rejecting or accepting
the hypothesis H0: ES+CX:ES = 0
Multiple testing (1)


The hypothesis H0: ES=CX:ES  0 is
performed individually on thousands of
genes  multiple testing adjustment
required.
Control of the false discovery rate (FDR)
seems more appropriate for microarray
data than other procedures.
Multiple Testing (2)
# true H
# false H
# not rej
U
# rejected totals
V (False +) m0
T (False -)
S
m1
m-R
R
m
totals
* Per-comparison = E(V)/m * Family-wise = p(V ≥ 1)
* Per-family = E(V)
* False discovery rate = E(V/R)
Multiple testing (3)



The method applied consists of controlling
the FDR so that its is guaranteed that this
won’t be higher than a given threshold.
The method is conservative and tends to give
longer lists of genes
A rejected hypothesis indicates an ES
target We can interpret the FDR as the
proportion of falsely identified ES targets
Outlier detection



Usually complicated in factorial experiments
The residuals from the fit of the linear model
must satisfy a number of constraints and
hence are not suitable for outlier detection
However, outlier detection is important since
the presence of outliers will inflate the
estimated variance and hence decrease our
ability to detect significant effects
Outliers
Outlier Detection (1)



The replicate structure of the experimental
design is used to locate single outliers in the
data set.
The algorithm is based on differences
between the replicate expression values that
are larger than expected
Assuming normality, a test statistic which
follows an F distribution is derived
Outlier Detection (2)


This method only identifies pairs with large
differences, not the single outlier itself.
Once pairs are identified, single outliers are
identified if one of the tagged replicates falls
outside the range:
(mede-4*made, mede+4*made)
Gene selection algorithm (1)
1.
2.
3.
Average the replicate observations and exclude any
genes with a maximum average less than 100 (using
the PM-only model for gene expression in dChip).
Remove all Affymetrix control sequences
Apply any necessary transformations to satisfy
Normality, then test for single outliers. If outliers are
identified, remove them from the data set.
Fit the linear model
yig   g   Eg x1i  CXg x2i   E:CX , g x1i x2i  ig
Gene selection algorithm (2)
Test H0Est: ES=CX:ES  0 for each gene.
Reject H0Est for the genes with the lowest
resultant p-values using a FDR of 0.01. Call
these genes ES targets.
For the ES targets, test H0pt: ES+CX:ES = 0.
4.
5.
6.
1.
2.
Call genes with p-values<0.01 for the test of
H0pt primary ES targets.
Call the remaining ES target genes secondary ES
targets.
Results (1) Primary targets
ES 0 or CX:ES  0
ES+CX:ES  0
Results (2) Secondary targets
ES 0 or CX:ES  0
ES+CX:ES  0
Conclusions


For gene selection using data from factorial
designed microarray studies, linear models
offer natural paradigm for analysis so long as
careful consideration is given to the
interpretation of the model parameters.
The use of CX in this experiment is one
example of a treatment that allows for the
identification of primary and secondary ES
targets.
Conclusions (2)


For experiments with more treatments
of interest, fractional factorial designs
may be applicable.
The candidate genes that are selected
using linear models would serve as
good candidates for network
reconstruction algorithms.
Acknowledgments

Special thanks to Denise Scholtens and
Robert Gentleman, Biostatistics,
Harvard U. for making their materials
available
Disclaimer



The goal of this presentation is to discuss the
contents of the paper indicated in the title
Copyrighted images have been taken from
the corresponding journals or from slide
shows found in internet with the only goal to
facilitate the discussion
All merit for them has to be attributed to the
authors of the papers or the slide shows and
we wish to thank them for making them
available