Transcript design
Statistical Principles of
Experimental Design
Chris Holmes
Thanks to Dov Stekel
Sources of Variation
• Systematic:
• Dye effects, print tips
• Chance, or random variation
• Population variability
• Unforeseen effects
• E.g. unnoticed change in experimental
conditions
Overview
• Fundamental concepts: confidence and
power
• How many replicates?
• Blocking and randomization
• Arrangement of samples and arrays
Microarray
•
•
We are interested in detecting
differentially expressed genes
When stating that a gene is (or is not)
differentially expressed two
complementary things can go wrong
1. You say it is differentially
expressed and it’s not
2. You say it’s not diff. and it is!
Type I and Type II Errors
Our Call
Truth
H0
H1
H0 – No effect
Correct
Type I error
H1 - Effect
Type II error
Correct
Confidence
• The confidence is the probability of not getting
a false positive (FP) result.
• FP: It’s not diff. expressed and you say it is
• It is the probability of accepting the null
hypothesis when the null hypothesis is true.
• A false positive result is known as a Type I
Error.
• We control for Type I errors explicitly by
selecting an appropriate confidence level
• In microarray experiments, we must modify
the confidence level to account for multiplicity
Power
• The power is the probability of not getting a
false negative (FN) result.
• FN: It is diff expressed and you miss it
• It is the probability of rejecting the null
hypothesis when the null hypothesis is false.
• A false negative result is known as a Type II
Error.
• We control the power implicitly via the
confidence level and the experimental design.
Type I and Type II Errors
Our Call
Truth
H0
H1
H0 – No effect
Correct
Type I error
H1 - Effect
Type II error
Correct
Power Analysis
• Calculating the Power of a study is a
vital part of experimental design
• An overpowered study is wasteful of
resources
• An underpowered study will be unable to
reveal interesting results
Power Analysis
• The power of an analysis depends on
the following factors:
• True (unknown) difference in mean we
are trying to detect
• (Unknown) Standard deviation of the
population
• Chosen significance threshold
• Type of test
• Number of replicates
Experimental Variability
•
•
•
•
•
•
•
Individuals
Sample preparations
Dyes
Print runs
Pins
Arrays
Hybridizations
•
•
•
•
Laboratories
Researchers
Imaging
Software
Experimental Variability
• In order to perform a power analysis we must
measure and quantify the levels of
experimental variability in our system:
• Calibration experiments
• Pilot experiments
• We perform the analysis relative to the largest
source of variability.
• The best way to get maximum power from a
statistical analysis is to minimise the level of
experimental error and noise
Power Analysis Assumptions
• We assume that the data is approximately log
normally distributed
• This corresponds to standard deviation of the
errors of the raw data being proportional to the
signal intensity
• This is equivalent to a constant standard
deviation in the logged data
• The standard deviation divided by the mean is
called the coefficient of variation
Log Normally Distributed Data
Fold Ratios and Mean Differences
• If the data is log normally distributed:
• The difference in mean in the logged data is
equal to the log of the fold ratio of the raw
data.
• The standard deviation (s) of the logged data
relates to the coefficient of variation (v) of the
raw data via the formula:
s = sqrt(ln(v2+1)) / ln(2)
Power Analysis
• We will use the power.t.test() formula in
R to calculate the power of one and two
sample tests
• power.t.test(n, delta, sd,
sig.level, power, type,
alternative)
• Formula is used with one of the first five
variables omitted and will calculate the
unknown variable
Power Analysis Example:
Doxorubicin Chemotherapy
• We are interested in the treatment of breast
cancer patients with doxorubicin
chemotherapy
• We want to perform a microarray experiment
to determine genes that are up- or downregulated as a result of the chemotherapy
• We would like to know:
• How to design the experiment?
• How many patients we need?
Paired vs Unpaired Design
• In a paired design, we take samples from
each patient before and after treatment, and
for each gene, look at the difference in
expression before and after treatment
• In an unpaired design, we have two groups of
patients, one group treated, the other group
untreated. We look at the difference in gene
expression between the two groups
• Which is a better experiment?
Power Analysis Assumptions
• Suppose we know from a pilot study and
evaluation of our technology that the
coefficient of variation is 40%
• Let's say that we want to detect genes that are
2-fold regulated
• We are testing 10,000 genes so we will use a
signficance threshold of 0.0001 to compensate
for multiplicity
• How many patients do we need for a power of
80%, 90% and 99%?
Paired Experiment
• The standard deviation of the underlying
normal distribution equivalent to 40%
variability is 0.39
• The difference in means is log2(2) = 1
• The number of patients we need is:
Power
80%
90%
99%
Number
14
16
20
Unpaired Experiment
• The standard deviation and difference in
means is the same.
• The number of patients we need is:
Power
80%
90%
99%
Group Size
18
21
28
Number
36
42
56
1-Sample Number
14
16
20
Paired vs Unpaired
• In this example, we need more than
twice the patients in the unpaired
experiment to obtain the same power as
the paired experiment
• Paired experimental design is more
powerful than unpaired experimental
design because the differences between
individuals are factored out in the
analysis
Blocking, Randomization and
Blinding
• Arrangement of experimental design that
minimises problems from extraneous
sources of variability
• Use blocking to avoid confounding
• Use randomization and blinding to avoid
bias
Toxicity Example
• We are interested in characterising the
toxic effect of Benzo(a)pyrene (BP) on
rats
• 8 Rats are to be treated with BP and 8
rats with a control compound
• Each array will be hybridized against a
reference sample
• 16 Arrays in the experiment
Experimental Design
• Suppose there are two batches of 8
slides from two different print runs (1 and
2)
• Hybridisation will be done by two
researchers, Alison and Brian.
• What is the best way to arrange the
experiment?
Design 1
• Alison prepares all 8 BP samples and
hybridises them to the arrays of print run
1
• Brian prepares all 8 control samples and
hybridises them to the arrays of print run
2
Design 2
• Alison chooses 8 rats and treats 4 with BP
and 4 with control substance.
• She prepares and hybridises 2 BP samples to
arrays from print run 1 and 2 BP samples to
arrays from print run 2
• She prepares and hybridises 2 control
samples to arrays from print run 1 and 2
control samples to arrays from print run 2
• Brian does the same with the other 8 rats
Design 2
Alison
Print Run 1
Print Run 2
Print Run 1
Print Run 2
Control
Treated
Brian
Control
Treated
Design 3
• 8 rats are randomly assigned to Alison, along
with 4 BP preparations and 4 control
preparations. She is not told which
preparations are which.
• She prepares and hybridises samples to
randomly pre-arranged arrays so that 2 BP
samples and 2 control samples are hybridised
to 4 arrays from each of print runs 1 and 2.
• Brian does the same with the other 8 rats
What is wrong with design 1?
• Treatment, researcher and print run are
confounded variables
• We cannot tell whether differences between
the two groups of rats result from treatment,
researcher or print run
• Use blocking in designs 2 and 3 to
deconfound the variability of interest
(treatment) from the extraneous variabilities
(researcher and print run)
• Designs 2 and 3 are also balanced which
increases power of analyses
What is wrong with design 2?
• Alison's choice of rats may be biased
• For example, she may choose the
healthiest rats, so confounding potential
treatment effects with researcher
variability
• Use randomization and blinding in
design 3 to avoid bias
Arrangement of Samples and Arrays
• Is it better to use Affymetrix arrays or a
two-colour array system?
• If using a two-colour array system, is it
better to use a reference sample?
• If using a two-colour array system, what
is the best arrangement of samples on
the slides?
Several Factors
•
•
•
•
Available technology
Cost
Statistical considerations
We consider problem from perspective
of three different experiments
Example 1:
Hepatocellular Carcinomas
• 20 Samples are taken from disease and
healthy tissue from patients suffering
from hepatocellular carcinomas and
hybridised to microarrays. We would
like to identify genes that are up- or
down- regulated in hepatocellular
carcinomas relative to healthy tissue.
Design 1.1
Reference
Sample
Reference
Sample
x 20
Healthy 1
Disease 1
Array 1
Array 2
Design 1.2
Healthy 1
GeneChip 1
Disease 1
GeneChip 2
x 20
Design 1.3
Healthy 1
x 20
Disease 1
Array 1
Design 1.4
Healthy 1
Healthy 11
x 10
x 10
Disease 1
Disease 11
Array 1
Array 11
Design 1.5
Healthy 1
Healthy 1
x 20
Disease 1
Disease 1
Array 1
Array 2
Which is the best design?
•
•
•
•
Simple experiment - five different
designs!
Design 1.1 is bad because it increases
variability.
Design 1.3 is bad because it
confounds colour with disease state.
Designs 1.4 and 1.5 are best.
Design 1.1
Reference
Sample
Reference
Sample
Healthy
Disease
Array 1
Array 2
• Coefficient of
Variability is 30%
• Design increases
variability to 43%
Design 1.5
Healthy
Healthy
Disease
Disease
Array 1
Array 2
• Coefficient of
Variability: 30%
• Experimental
design reduces
variability to 21%
Example 2:
B-Cell Lymphomas
• Samples are taken from 60 patients
suffering from B-cell lymphomas and
hybridised to microarrays. The aim of
the experiment is to identify clinically
relevant subgroups of patients using a
cluster analysis, and then to build a
classification model to differentiate
between the subgroups.
Design 2.1
Patient 1
x 30
Patient 2
Array 1
Design 2.2
Patient 1
x 60
Reference
Array 1
Design 2.3
Patient 1
GeneChip 1
x 60
Which design is best?
• Design 2.1 is bad because it is difficult
to compare patients on equal footing.
• Designs 2.2 and 2.3 are good.
• Probably most appropriate use of
Affymetrix technology.
Example 3:
Yeast Time Series
• Budding yeast can reproduce sexually
by producing haploid cells through a
process called sporulation. Yeast was
placed in a sporulating medium,
samples taken at 7 timepoints from the
start of sporulation. We are interested in
identifying genes that show similar
profiles in the timecourse.
Design 3.1
Time 0
Time 0
Time 0
Time 0
Time 0
Time 0
Time 1
Time 2
Time 3
Time 4
Time 5
Time 6
Array 1
2
3
4
5
6
Design 3.2
Time 0
Time 1
Time 2
Time 3
Time 4
Time 5
Time 6
Time 1
Time 2
Time 3
Time 4
Time 5
Time 6
Time 0
Array 1
2
3
4
5
6
7
Design 3.3
Time 0
Time 1
GeneChip 1 2
Time 2
Time 3
Time 4
Time 5
3
4
5
6
Time 6
7
Which is the best design?
• Design 3.3 requires careful
normalisation because timepoint is
confounded with array.
• Design 3.2 is a loop design. It is a good
design, but harder to analyse.
• Design 3.1 may be the best design.
Conclusions
• Number of replicates:
• Calculate using power analyses
• Extraneous variability:
• Block to avoid confounding variables
• Randomisation to avoid bias
• Blocked experiments require ANOVA
analyses
Conclusions
• Two sample experiments
• Reference samples increase variability.
• Hybridise both samples to same array.
• Multiple patient comparisons
• Reference samples or Affymetrix
technology enable comparisons.
• Time series analysis
• Reference samples are useful.
Practical
• Use reference sample to estimate coefficient
of variability
• Power analysis for population inference test
• Power and false positive analysis for
differentially expressed genes in a single
patient
References
• Statistics for Experimenters. (1978). Box,
G.E.P et al. Wiley.
• Gary Churchill’s group, Jackson Laboratory
• Statistics for Microarrays. (2005). E. Wit