Affymetrix and Two-Color Arrays

Download Report

Transcript Affymetrix and Two-Color Arrays

Gene Expression Arrays
EPP 245
Statistical Analysis of
Laboratory Data
1
Basic Design of Expression Arrays
• For each gene that is a target for the array,
we have a known DNA sequence.
• mRNA is reverse transcribed to DNA, and
if a complementary sequence is on the on
a chip, the DNA will be more likely to stick
• The DNA is labeled with a dye that will
fluoresce and generate a signal that is
monotonic in the amount in the sample
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
2
Intron
Exon
TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG
ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC
Probe Sequence
• cDNA arrays use variable length probes derived from
expressed sequence tags
– Spotted and almost always used with two color methods
– Can be used in species with an unsequenced genome
• Long oligoarrays use 60-70mers
– Agilent two-color arrays
– Spotted arrays from UC Davis or elsewhere
– Usually use computationally derived probes but can use probes
from sequenced EST’s
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
3
• Affymetrix GeneChips use multiple 25-mers
– For each gene, one or more sets of 8-20 distinct
probes
– May overlap
– May cover more than one exon
• Affymetrix chips also use mismatch (MM) probes
that have the same sequence
as perfect match probes except for the middle
base which is changed to inhibit
binding.
• This is supposed to act as a control, but often
instead binds to another mRNA
species, so many analysts do not use them
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
4
Probe Design
• A good probe sequence should match the
chosen gene or exon from a gene and
should not match any other gene in the
genome.
• Melting temperature depends on the GC
content and should be similar on all
probes on an array since the hybridization
must be conducted at a single
temperature.
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
5
• The affinity of a given piece of DNA for the
probe sequence can depend on many
things, including secondary and tertiary
structure as well as GC content.
• This means that the relationship between
the concentration of the RNA species in
the original sample and the brightness of
the spot on the array can be very different
for different probes for the same gene.
• Thus only comparisons of intensity within
the same probe across arrays makes
sense.
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
6
Affymetrix GeneChips
• For each probe set, there are 8-20 perfect
match (PM) probes which may overlap or
not and which target the same gene
• There are also mismatch (MM) probes
which are supposed to serve as a control,
but do so rather badly
• Most of us ignore the MM probes
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
7
Expression Indices
• A key issue with Affymetrix chips is how to
summarize the multiple data values on a
chip for each probe set (aka gene).
• There have been a large number of
suggested methods.
• Generally, the worst ones are those from
Affy, by a long way; worse means less
able to detect real differences
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
8
Usable Methods
• Li and Wong’s dCHIP and follow on work
is demonstrably better than MAS 4.0 and
MAS 5.0, but not as good as RMA and
GLA
• ArrayAssist can use dCHIP, RMA, gcRMA,
and others.
• The GLA method (Durbin, Rocke, Zhou)
can be imported into ArrayAssist.
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
9
Steps in Expression Index
Construction
• Background correction is the process of
adjusting the signals so that the zero point
is similar on all parts of all arrays.
• We like to manage this so that zero signal
after background correction corresponds
approximately to zero amount of the
mRNA species that is the target of the
probe set.
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
10
• Data transformation is the process of
changing the scale of the data so that it is
more comparable from high to low.
• Common transformations are the
logarithm and generalized logarithm
• Normalization is the process of adjusting
for systematic differences from one array
to another.
• Normalization may be done before or after
transformation, and before or after probe
set summarization.
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
11
• One may use only the perfect match (PM)
probes, or may subtract or otherwise use
the mismatch (MM) probes
• There are many ways to summarize 20
PM probes and 20 MM probes on 10
arrays (total of 200 numbers) into 10
expression index numbers
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
12
The RMA Method
• Background correction that does not make
0 signal correspond to 0 amount
• Quantile normalization makes the overall
distribution of intensity values across
probes the same on each array
• Log2 transform
• Median polish summary of PM probes
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
13
Analysis by means
•Remove Row Means
•Remove Column Means
•Rows and Columns have
mean 0
•Influence of an outlier spreads
4.00 6.00 5.00 5.00
8.00 9.00 7.00 8.00
12.00 24.00 12.00 16.00
8.00 13.00
8.00
-1.00
0.00
-4.00
1.00 0.00
1.00 -1.00
8.00 -4.00
-1.67
3.33 -1.67
0.67 -2.33 1.67
1.67 -2.33 0.67
-2.33 4.67 -2.33
0.00
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
14
Median Polish
•Remove Row Medians
•Remove Column Medians
•Rows and Columns may not
have median 0
•Outliers contained
•May have to be iterated
4.00 6.00 5.00 5.00
8.00 9.00 7.00 8.00
12.00 24.00 12.00 12.00
8.00
9.00
7.00
-1.00 1.00 0.00 0.00
0.00 1.00 -1.00 0.00
0.00 12.00 0.00 0.00
0.00
1.00
0.00
-1.00 0.00 0.00 0.00
0.00 0.00 -1.00 0.00
0.00 11.00 0.00 0.00
0.00
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
0.00
0.00
15
Example Probe Set
• Using the Affy HG U133 Plus 2.0
GeneChip with 54675 probe sets, from
604258 PM probes.
• Four chips derived from human IR
exposed skin at 0, 1, 10, and 100 cGy
• Probe set number 10067/54675 has Affy
ID 200618_at
• Gene is LASP1, LIM and SH3 protein 1,
LIM protein subfamily, Src homology, actin
binding.
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
16
Mean Summarization
November 9, 2006
0
1
10
100
200618_at1
360
216
158
198
233.0
200618_at2
313
402
106
103
231.0
200618_at3
130
182
79
91
120.5
200618_at4
351
370
195
136
263.0
200618_at5
164
130
98
107
124.8
200618_at6
223
219
164
196
200.5
200618_at7
437
529
195
158
329.8
200618_at8
509
554
274
128
366.3
200618_at9
522
720
285
198
431.3
200618_at10
668
715
247
260
472.5
200618_at11
306
286
144
159
223.8
362.1
393.0
176.8
157.6
EPP 245 Statistical Analysis of
Laboratory Data
17
Mean Summarization
of the Logs
November 9, 2006
0
1
10
100
200618_at1
2.56
2.33
2.20
2.30
2.35
200618_at2
2.50
2.60
2.03
2.01
2.28
200618_at3
2.11
2.26
1.90
1.96
2.06
200618_at4
2.55
2.57
2.29
2.13
2.38
200618_at5
2.21
2.11
1.99
2.03
2.09
200618_at6
2.35
2.34
2.21
2.29
2.30
200618_at7
2.64
2.72
2.29
2.20
2.46
200618_at8
2.71
2.74
2.44
2.11
2.50
200618_at9
2.72
2.86
2.45
2.30
2.58
200618_at10
2.82
2.85
2.39
2.41
2.62
200618_at11
2.49
2.46
2.16
2.20
2.33
2.51
2.53
2.21
2.18
EPP 245 Statistical Analysis of
Laboratory Data
18
The GLA Method
• The Glog Average (GLA) method is simpler than
the RMA method, though it can require
estimation of a parameter
• Background correction is intended to make a
measured value of zero correspond to a zero
quantity in the sample
• Transformation uses the glog ~ ln for large
values
• Normalization via lowess
• Summary is a simple average of PM probes
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
19
Probe Sets not Genes
• It is unavoidable to refer to a probe set as
measuring a “gene”, but nevertheless it can be
deceptive
• The annotation of a probe set may be based on
homology with a gene of possibly known
function in a different organism
• Only a relatively few probe sets correspond to
genes with known function and known structure
in the organism being studied
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
20
Two-Color Arrays
• Two-color arrays are designed to account
for variability in slides and spots by using
two samples on each slide, each labeled
with a different dye.
• If a spot is too large, for example, both
signals will be too big, and the difference
or ratio will eliminate that source of
variability
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
21
Dyes
• The most common dye sets are Cy3
(green) and Cy5 (red), which fluoresce at
approximately 550 nm and 649 nm
respectively (red light ~ 700 nm, green
light ~ 550 nm)
• The dyes are excited with lasers at 532
nm (Cy3 green) and 635 nm (Cy5 red)
• The emissions are read via filters using a
CCD device
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
22
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
23
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
24
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
25
File Format
• A slide scanned with Axon GenePix
produces a file with extension .gpr that
contains the results:
http://www.axon.com/gn_GenePix_File_Formats.html
• This contains 29 rows of headers followed
by 43 columns of data (in our example
files)
• For full analysis one may also need a .gal
file that describes the layout of the arrays
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
26
"Block"
"Column"
"Row"
"Name"
"ID"
"X"
"Y"
"Dia."
"F635 Median"
"F635 Mean"
"F635 SD"
"B635 Median"
"B635 Mean"
"B635 SD"
"% > B635+1SD"
"% > B635+2SD"
"F635 % Sat."
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
27
"F532 Median"
"F532 Mean"
"F532 SD"
"B532 Median"
"B532 Mean"
"B532 SD"
"% > B532+1SD"
"% > B532+2SD"
"F532 % Sat."
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
28
"Ratio of Medians (635/532)"
"Ratio of Means (635/532)"
"Median of Ratios (635/532)"
"Mean of Ratios (635/532)"
"Ratios SD (635/532)"
"Rgn Ratio (635/532)"
"Rgn R² (635/532)"
"F Pixels"
"B Pixels"
"Sum of Medians"
"Sum of Means"
"Log Ratio (635/532)"
"F635 Median - B635"
"F532 Median - B532"
"F635 Mean - B635"
"F532 Mean - B532"
"Flags"
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
29
Analysis Choices
• Mean or median foreground intensity
• Background corrected or not
• Log transform (base 2, e, or 10) or glog
transform
• Log is compatible only with no background
correction
• Glog is best with background correction
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
30
DDR1
Block
1
Column
1
Row
1
Name
ID
NM_006182
discoidin domain receptor family, member
X
2575
Y
2565
Dia.
November 9, 2006
85
EPP 245 Statistical Analysis of
Laboratory Data
31
F635 Median
F635 Mean
F635 SD
48
54
23
F532 Median
F532 Mean
F532 SD
109
113
26
B635 Median
B635 Mean
B635 SD
% > B635+1SD
34
36
11
52
B532 Median
B532 Mean
B532 SD
% > B532+1SD
35
36
7
100
% > B635+2SD
F635 % Sat.
36
0
% > B532+2SD
F532 % Sat.
100
0
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
32
Issues with Two-Color Arrays
• Chips have different overall intensities, so
normalization across chips is needed.
• The overall intensity on the red channel
may be greater or less than on the green
channel, so normalization across dyes is
needed.
• The red/green difference is can be
different at different intensity levels
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
33
Array normalization
• Array normalization is meant to increase
the precision of comparisons by adjusting
for variations that cover entire arrays
• Without normalization, the analysis would
be valid, but possibly less sensitive
• However, a poor normalization method will
be worse than none at all.
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
34
Possible normalization methods
• We can equalize the mean or median
intensity by adding or multiplying a
correction term
• We can use different normalizations at
different intensity levels (intensity-based
normalization) for example by lowess or
quantiles
• We can normalize for other things such as
print tips
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
35
Example for Normalization
Group 1
Group 2
Array 1 Array 2 Array 3 Array 4
Gene 1
1100
900
425
550
Gene 2
110
95
85
110
Gene 3
80
65
55
80
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
36
. list Array Group Gene Expression
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
+---------------------------------+
| Array
Group
Gene
Expres~n |
|---------------------------------|
|
1
1
1
1100 |
|
2
1
1
900 |
|
3
2
1
425 |
|
4
2
1
550 |
|
1
1
2
110 |
|---------------------------------|
|
2
1
2
95 |
|
3
2
2
85 |
|
4
2
2
110 |
|
1
1
3
80 |
|
2
1
3
65 |
|---------------------------------|
|
3
2
3
55 |
|
4
2
3
80 |
+---------------------------------+
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
37
. sort Gene
. by Gene: anova Expression Group
--------------------------------------------------------------------------------> Gene = 1
Number of obs =
4
Root MSE
= 117.925
R-squared
=
Adj R-squared =
0.9042
0.8564
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model |
262656.25
1
262656.25
18.89
0.0491
|
Group |
262656.25
1
262656.25
18.89
0.0491
|
Residual |
27812.5
2
13906.25
-----------+---------------------------------------------------Total |
290468.75
3 96822.9167
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
38
-> Gene = 2
Number of obs =
4
Root MSE
= 14.5774
R-squared
= 0.0556
Adj R-squared = -0.4167
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model |
25
1
25
0.12
0.7643
|
Group |
25
1
25
0.12
0.7643
|
Residual |
425
2
212.5
-----------+---------------------------------------------------Total |
450
3
150
-------------------------------------------------------------------------------> Gene = 3
Number of obs =
4
R-squared
= 0.0556
Root MSE
= 14.5774
Adj R-squared = -0.4167
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model |
25
1
25
0.12
0.7643
|
Group |
25
1
25
0.12
0.7643
|
Residual |
425
2
212.5
-----------+---------------------------------------------------Total |
450
3
150
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
39
Additive Normalization by Means
Group 1
Group 2
Array 1 Array 2 Array 3 Array 4
Gene 1
975
851
541
608
Gene 2
-15
46
201
168
Gene 3
-45
16
171
138
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
40
. mean Expression
. ereturn list
. matrix ExpMeanMat = e(b)
. matlist ExpMeanMat
scalars:
e(df_r)
e(N_over)
e(N)
e(k_eq)
e(k_eform)
=
=
=
=
=
11
1
12
1
0
| Express~n
-------------+----------y1 | 304.5833
. scalar ExpMean = ExpMeanMat[1,1]
macros:
e(cmd)
e(title)
e(estat_cmd)
e(varlist)
e(predict)
e(properties)
:
:
:
:
:
:
"mean"
"Mean estimation"
"estat_vce_only"
"Expression"
"_no_predict"
"b V"
. display ExpMean
304.58333
. anova Expression Array
. predict ArrayMean
. generate NormExp1=Expression-ArrayMean
+ExpMean
matrices:
e(b)
e(V)
e(_N)
e(error)
:
:
:
:
1
1
1
1
x
x
x
x
1
1
1
1
functions:
e(sample)
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
41
. list Array Group Gene Expression ArrayMean NormExp1
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
+--------------------------------------------------------+
| Array
Group
Gene
Expres~n
ArrayM~n
NormExp1 |
|--------------------------------------------------------|
|
1
1
1
1100
430
974.5833 |
|
2
1
1
900
353.3333
851.25 |
|
3
2
1
425
188.3333
541.25 |
|
4
2
1
550
246.6667
607.9167 |
|
1
1
2
110
430
-15.41667 |
|--------------------------------------------------------|
|
2
1
2
95
353.3333
46.24999 |
|
3
2
2
85
188.3333
201.25 |
|
4
2
2
110
246.6667
167.9167 |
|
1
1
3
80
430
-45.41667 |
|
2
1
3
65
353.3333
16.24999 |
|--------------------------------------------------------|
|
3
2
3
55
188.3333
171.25 |
|
4
2
3
80
246.6667
137.9167 |
+--------------------------------------------------------+
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
42
. by Gene: anova NormExp1 Group
------------------------------------------------------------------------------------> Gene = 1
Number of obs =
4
Root MSE
= 70.0991
R-squared
=
Adj R-squared =
0.9209
0.8814
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model | 114469.431
1 114469.431
23.30
0.0403
|
Group | 114469.431
1 114469.431
23.30
0.0403
|
Residual | 9827.77662
2 4913.88831
-----------+---------------------------------------------------Total | 124297.207
3 41432.4024
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
43
-> Gene = 2
Number of obs =
4
Root MSE
= 35.0496
R-squared
=
Adj R-squared =
0.9209
0.8814
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model | 28617.3614
1 28617.3614
23.30
0.0403
|
Group | 28617.3614
1 28617.3614
23.30
0.0403
|
Residual |
2456.9441
2 1228.47205
-----------+---------------------------------------------------Total | 31074.3055
3 10358.1018
-------------------------------------------------------------------------------------> Gene = 3
Number of obs =
4
Root MSE
= 35.0496
R-squared
=
Adj R-squared =
0.9209
0.8814
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model | 28617.3612
1 28617.3612
23.30
0.0403
|
Group | 28617.3612
1 28617.3612
23.30
0.0403
|
Residual | 2456.94427
2 1228.47214
-----------+---------------------------------------------------3 Analysis
10358.1018
November 9, 2006 Total | 31074.3055
EPP 245 Statistical
of
44
Laboratory Data
Multiplicative Normalization by Means
Group 1
Group 2
Array 1 Array 2 Array 3 Array 4
Gene 1
779
776
687
679
Gene 2
78
82
137
136
Gene 3
57
56
89
99
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
45
. generate NormExp2 = Expression*ExpMean/ArrayMean
. list Array Group Gene Expression ArrayMean NormExp2
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
+-------------------------------------------------------+
| Array
Group
Gene
Expres~n
ArrayM~n
NormExp2 |
|-------------------------------------------------------|
|
1
1
1
1100
430
779.1667 |
|
2
1
1
900
353.3333
775.8254 |
|
3
2
1
425
188.3333
687.3341 |
|
4
2
1
550
246.6667
679.1385 |
|
1
1
2
110
430
77.91666 |
|-------------------------------------------------------|
|
2
1
2
95
353.3333
81.89268 |
|
3
2
2
85
188.3333
137.4668 |
|
4
2
2
110
246.6667
135.8277 |
|
1
1
3
80
430
56.66667 |
|
2
1
3
65
353.3333
56.03184 |
|-------------------------------------------------------|
|
3
2
3
55
188.3333
88.94912 |
|
4
2
3
80
246.6667
98.78378 |
+-------------------------------------------------------+
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
46
. by Gene: anova NormExp2 Group
--------------------------------------------------------------------------------> Gene = 1
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model | 8884.90342
1 8884.90342
453.70
0.0022
--------------------------------------------------------------------------------> Gene = 2
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model | 3219.72043
1 3219.72043
696.33
0.0014
--------------------------------------------------------------------------------> Gene = 3
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model | 1407.54019
1 1407.54019
57.97
0.0168
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
47
Multiplicative Normalization by Medians
Group 1
Group 2
Array 1 Array 2 Array 3 Array 4
Gene 1
1025
971
512
512
Gene 2
102
102
102
102
Gene 3
75
70
66
74
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
48
. sort Array
. table Array, contents(p50 Expression)
------------------------Array | med(Expres~n)
----------+-------------1 |
110
2 |
95
3 |
85
4 |
110
------------------------. input ArrayMed
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
ArrayMed
110
110
110
95
95
95
85
85
85
110
110
110
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
49
. summarize Expression, detail
Expression
------------------------------------------------------------Percentiles
Smallest
1%
55
55
5%
55
65
10%
65
80
Obs
12
25%
80
80
Sum of Wgt.
12
50%
75%
90%
95%
99%
102.5
487.5
900
1100
1100
Largest
425
550
900
1100
Mean
Std. Dev.
304.5833
363.1144
Variance
Skewness
Kurtosis
131852.1
1.277954
3.132949
. generate NormExp3 = Expression*102.5/ArrayMed
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
50
-> Gene = 1
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model | 235735.794
1 235735.794
324.00
0.0031
------------------------------------------------------------------------------------> Gene = 2
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model |
0
1
0
------------------------------------------------------------------------------------> Gene = 3
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model |
3.6253006
1
3.6253006
0.17
0.7228
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
51
Intensity-based normalization
• Normalize by means, medians, etc., but do
so only in groups of genes with similar
expression levels.
• lowess is a procedure that produces a
running estimate of the middle, like a
robustified mean
• If we subtract the lowess of each array
and add the average of the lowess’s, we
get the lowess normalization
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
52
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
53
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
54
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
55
Fitting a model to genes
• We can fit a model to the data of each
gene after the whole arrays have been
background corrected, transformed, and
normalized
• Each gene is then test for whether there is
differential expression
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
56
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
57
Multiplicity Adjustments
• If we test thousands of genes and pick all
the ones which are significant at the 5%
level, we will get hundreds of false
positives.
• Multiplicity adjustments winnow this down
so that the number of false positives is
smaller
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
58
Types of Multiplicity Adjustments
• The Bonferroni correction aims to detect
no significant genes at all if there are truly
none, and guarantees that the chance that
any will be detected is less than .05 under
these conditions
• Generally, this is too conservative
• Less conservative versions include
methods due to Holm, Hochberg, and
Benjamini and Hochberg (FDR)
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
59
November 9, 2006
EPP 245 Statistical Analysis of
Laboratory Data
60