Using Statistical Design and Analysis to Detect

Download Report

Transcript Using Statistical Design and Analysis to Detect

Statistical Design and Analysis
of Microarray Experiments
Peng Liu
6/15/2010
1
Microarray Technology
 Microarray technology allows measuring
expression levels (abundance of mRNA
transcripts) of thousands of genes
simultaneously.
 Two types of platforms:
Affymetrix (single-color)
Two-color microarray
2
Wild-type vs. Myostatin Knockout Mice
Belgian Blue
cattle have a
mutation in the
myostatin gene.
Design of Affymetrix experiment: one sample  one chip
Designing 2-color microarray (3 layers)
From Churchill, 2002, nature genetics
4
Example I: Sawers et al, 2007, BMC Bioinformatics
bundle sheath strands
mesophyll protoplasts
M B
V
5
Example I: Sawers et al, 2007, BMC Bioinformatics
 The establishment of C4 photosynthesis in
maize is associated with differential
accumulation of gene transcripts and proteins
between bundle sheath and mesophyll
photosynthetic cell types.
 Goal: To detect genes that are differentially
expressed in Bundle Sheath (B) and Mesophyll
(M) cells.
6
Example I: Sawers et al, 2007, BMC Bioinformatics
 A simple method:
Isolate cells and perform a microarray
experiments to compare the gene expression
between the two cells (treatments).
7
Example I: Sawers et al, 2007, BMC Bioinformatics
 A little more complication:
The procedure for extracting mRNA for the two
cells are different. The one to extract mRNA
from M cells introduces stress.
 Solution:
Add two more treatment groups: samples with
both M and B cells going through extraction of
mRNA with and without stress.
B, M, Stress and Total (4 treatment groups)
8
Direct comparison vs indirect comparison
 Direct: comparison within slide
 Indirect: comparison between slides
 Suppose we want to compare gene expression
levels between treatment 1 and treatment 2.
1
2
1
2
Direct Comparison
2
1
R
Indirect Comparison
9
Comments about 2-color Microarray Designs
 A unique and powerful feature of 2-color
microarray is to make direct comparison
between two samples on the same slide.
 For pairing samples, the variation due to slide
can be accounted for.
 When possible, it is more efficient to use direct
comparison.
 However, sometimes, it is not practical to make
direct comparison of all possible pairs.
10
Efficiency of comparison
 The efficiency of comparisons between 2
samples is determined by the length and the
number of paths connecting them.
1
2
1
2
Direct Comparison
(Dye-swap)
2
1
R
Indirect Comparison
11
Reference vs Loop design
1
2
R
Reference Design
3
2
1
3
Loop Design
12
Designing experiment for example I
B
With 6 biological
replicates
Total
Stress
M
13
Performing the experiment (Nature cell biol. 2001
3:8)
14
After the bench work…
2-color microarray image
Affymetrix Gene Chip image
15
The data table looks like
Header
Begin Raw Data
Flag
Row Column Gene ID
Field Meta Row Meta Column
1 MZ00040724
1
1
1
A
2 MZ00040730
1
1
1
A
3 MZ00040748
1
1
1
A
4 MZ00040754
1
1
1
A
5 MZ00040772
1
1
1
A
6 MZ00040778
1
1
1
A
7 MZ00040796
1
1
1
A
8 MZ00040802
1
1
1
A
9 MZ00013020
1
1
1
A
10 MZ00013026
1
1
1
A
11 MZ00013044
1
1
1
A
0
2
0
0
0
2
2
3
3
3
3
Mean Median
Background
Signal MedianSignal
533
1645.5
469
613
462
741.5
473
909
471.5
964
469
574
487
579
614
38051
516.5
4539
491.5
597.5
521.5
16210
16
Pre-normalization analysis
 Image processing
obtain the intensity measurement of the signal
 Background correction
get rid of local background that might due to nonspecific binding and obtain the target sample intensity
 Filtration
remove unreliable spots and reduce the dimension of
data
 Transformation
convert data into a format that makes data analysis
valid or easier
17
Normalization
 Normalization describes the process of
removing (or minimizing) non-biological variation
in measured signal intensity levels so that
biological differences in gene expression can be
appropriately detected.
 Aim: remove sources of systematic variation
 Example of non-biological variation: dye
difference for 2-color microarray
18
Figure from Dudoit et al, 2002, Statistica Sinica
Self-self experiment
19
Log Red-Log Green = M
Normalization: M vs. A Plot (45o rotation)
(Log Green+Log Red)/2 = A
20
Log Red-Log Green
LOWESS Fit
(Log Green+Log Red)/2
21
Normalized M
After normalization
A
22
Statistical Inference
 Data notation for normalized signal intensities
(NSI):
Yijk for each gene (g)
i: treatment index
j: dye index
k: slide index
Y114
treatment
Y224
dye slide
23
Fitting linear models to microarray data
 After the normalization, we have one observation
(normalized signal intensity) for each gene on each
channel (a combination of dye and array).
 Together, the data is an array with each row for one
gene and each column for one channel or one chip.
 We will fit a statistical model for each gene separately.
24
Mean expressions for 4 treatment groups
Treatments
means
μ+v2+
μ+v1
μ+c*v2+ (1-c)*v1
μ+c*v2+ (1-c)* v1+




M (M cell with stress)
B (B cell without stress)
TO (both cells without stress)
ST (both cells with stress)

Note that c is the proportion of M cells in the total leaf
sample with both cells.

We are interested in testing H0: v1 = v2, whether a given
gene is differentially expressed between M and B cells or
not.
25
Fixed effects
 The parameters on the previous slide (v1, v2, and )
specify fixed effects.
 Fixed effects are used to specify the mean of the
response variable.
 A factor is fixed if the levels of the factor were selected
by the investigator with the purpose of comparing the
effects of the levels to one another.
 The fixed effects included in the model depend on the
experimental design.
26
Random effects
 There are some random
effects that are unknown:
slide effects
other effects introduced in
the experiment (such as
biological replicate effects)
residual random effects
that include any sources of
variation unaccounted for
by other terms
B
Total
Stress
M
27
Random effects
 Random factors are used to specify the correlation
structure among the response variable observations.
e.g., observations on the same slide are more correlated than
observations from different slides.
 The random effects included in the model also depend
on the experimental design.
 A model that has both fixed and random effects is called
a mixed model.
28
Detecting differentially expressed genes
 Construct statistical test for parameters that we are
interested in, e.g., what are the difference in gene
expression (v1 - v2)?
v1 - v2  0 means differential expression.
 Model the random effects and perform tests or construct
confidence intervals.
 Perform tests for each gene and obtain a p-value.
Empirical Bayes test that borrows information across genes is
often used because of higher power.
29
Results from testing
A set
ID
1
3
8
9
11
12
16
18
21
22
33
35
37
38
40
46
48
50
Gene ID
MZ00040724
MZ00040748
MZ00040802
MZ00013020
MZ00013044
MZ00013050
MZ00013098
MZ00000486
MZ00000528
MZ00000534
MZ00032020
MZ00032044
MZ00032068
MZ00032074
MZ00032098
MZ00008134
MZ00008158
MZ00024806
…
v1-v2
-4.69E-01
1.01E-01
-4.10E-01
-4.96E-01
-2.77E-01
-7.81E-02
-7.50E-02
-5.16E-01
3.69E-01
4.98E-01
1.98E-01
-6.73E-01
-5.98E-01
-4.17E-01
-1.88E-01
2.11E-01
8.70E-02
1.01E-01
…
p-value for (v1-v2)q-value
0.33691808
0.61046054
0.18009214
0.12907116
0.26988092
0.77596069
0.73097085
0.005203899
0.25837106
0.041544897
0.52396675
0.000939694
0.016160615
0.27593771
0.28042709
0.77894787
0.79905176
0.73992828
…
0.4012188
0.5306277
0.2881755
0.2438822
0.3566803
0.5895432
0.5752585
0.04976865
0.3488733
0.1337469
0.4961501
0.02472483
0.0844817
0.3610925
0.3641593
0.5905477
0.5954345
0.5788615
…
30
2536 p-values below 0.05.
We would expect around 0.05*40000=2000
p-values to be less than 0.05 by chance
if no genes were differentially expressed.
0.05
31
Possible Errors in Testing ONE gene
Hypothesis
Accept Null
Reject Null (sig)
True Null
(non-DE)
correct
Type I Error
False Null
(DE)
Type II Error
correct (Power)
 Type I Error: false positives
 Type II Error: false negatives (1-power)
 Power: true positives
32
Error Rate in Multiple Testing
Outcomes when testing m genes
(Benjamini and Hochberg, 1995)
Hypothesis
True Null
(Non-DE)
False Null
(DE)
Total
Accept Null
U
Reject Null
V
Total
m0
T
S
m1
W
R
m
Family-wise error rate, FWER= Pr(V >0)
False Discovery Rate,
FDR = E(V/R |R>0) * Pr(R>0)
33
Results from testing for example I
A set
ID
1
3
8
9
11
12
16
18
21
22
33
35
37
38
40
46
48
50
Gene ID
MZ00040724
MZ00040748
MZ00040802
MZ00013020
MZ00013044
MZ00013050
MZ00013098
MZ00000486
MZ00000528
MZ00000534
MZ00032020
MZ00032044
MZ00032068
MZ00032074
MZ00032098
MZ00008134
MZ00008158
MZ00024806
…
v1-v2
-4.69E-01
1.01E-01
-4.10E-01
-4.96E-01
-2.77E-01
-7.81E-02
-7.50E-02
-5.16E-01
3.69E-01
4.98E-01
1.98E-01
-6.73E-01
-5.98E-01
-4.17E-01
-1.88E-01
2.11E-01
8.70E-02
1.01E-01
…
p-value for (v1-v2)q-value
0.33691808
0.61046054
0.18009214
0.12907116
0.26988092
0.77596069
0.73097085
0.005203899
0.25837106
0.041544897
0.52396675
0.000939694
0.016160615
0.27593771
0.28042709
0.77894787
0.79905176
0.73992828
…
0.4012188
0.5306277
0.2881755
0.2438822
0.3566803
0.5895432
0.5752585
0.04976865
0.3488733
0.1337469
0.4961501
0.02472483
0.0844817
0.3610925
0.3641593
0.5905477
0.5954345
0.5788615
…
34
Clustering
 Grouping genes into
different “clusters”
based on their
expression profile
 Clustering
35
Other analyses
 Relating the gene expressions with biological
functional categories  Gene Enrichment Test
 Connecting microarray data with other kinds of
data such as survival data.
 More …
36
Assigned References
 Nettleton, D. (2006) A Discussion of statistical methods
for design and analysis of microarray experiments for
plant scientists. The Plant Cell,18, 2112–2121.
37