Lecture 1/13/05

Download Report

Transcript Lecture 1/13/05

Genome-wide analysis
• We calculated a t-test for 30,000 genes at once
• How do we handle results, present data and results
• Normalization of the data as a mean of removing biases and reducing
experimental variability
• Two basic questions in the normalization process
•Are we attenuating the signal?
•Are we compromising the independence of our measurements?
• Outliers – part of the quality control.
• If we can identify physical reasons for excluding an observation (e.g. scratch on
the slide)
• Such physical problems are usually "flagged" in the process of quantifying
fluorescence intensities
• The questions of excluding a whole array from the analysis is particularly tricky –
we will discuss it further later
1-11-2005
1
Randomization Issue
The Problem:
Identify genes whose expression in a target organ (Lung) of a model organism (Rat)
is affected by an environmental toxicant (W)
Population:
All model organisms of this type (Rats)
Sample:
12 randomly selected rats from the population of all rats. (Randomly means that all
rats in the population have the equal chance of being selected)
Randomization:
Randomly select 6 rats to be treated by the toxicant. Randomly is the key word here
that allows us to ascribe observed changes to the treatment alone.
Prepare samples and extract RNA from all 12 rats
Randomly assign labeled RNA to different microarrays
Process microarrays in a random order
1-11-2005
2
Single Channel Microarrays – Each Sample Assigned to a
Different Microarray
•12 microarrays, 12 samples (C1,...,C6,W1,...,W6)
•Randomly assign samples to different microarrays
•In terms of a single gene, 12 different "spots"
W3 W5
W6 W1 W2
W4 C5
C1
C2
C4
C6
C3
e “GreenScanning
Channel”the “GreenScanning
Channel”the “GreenScanning
Channel”the “GreenScanning
Channel”the “GreenScanning
Channel”the “GreenScanning
Channel”the “GreenScanning
Channel”the “GreenScanning
Channel”the “GreenScanning
Channel”the “GreenScanning
Channel”the “GreenScanning
Channel”the “Green Channel”
(XG)
(XG)
(XG)
(XG)
(XG)
(XG)
(XG)
(XG)
(XG)
(XG)
(XG)
(XG)
he “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
Scanning the “Red Channel”
(XR)
(XR)
(XR)
(XR)
(XR)
(XR)
(XR)
(XR)
(XR)
(XR)
(XR)
(XR)
XR
XG
X R )  log(
X
X
X
X
X
X
X
X
X
X
RR))  log(
RR))  log(
RR))  log(
RR))  log(
RR))  log(
RR))  log(
RR))  log(
RR))  log(
RR))  log(
RR))  log(
log(X
log(X
XG
log(X
XG
log(X
XG
log(X
XG
log(X
XG
log(X
XG
log(X
XG
log(X
XG
log(X
XG
log(X
XG
log(X
X GR))  log(X G )
R
XG
XG
XG
XG
XG
XG
XG
XG
XG
XG
XG
•Proceed with a two-sample t-test as we did so far
1-11-2005
3
Two-Channel Microarrays – One C and One W Sample
Assigned to Each Microarray
•6 microarrays, 12 samples (C1,...,C6,W1,...,W6)
•Randomly select pairs and assign then to different microarrays
•In terms of a single gene, 6 different "spots"
W3 C5
W6 C1
W2 C2
W5 C6
W4 C4
Scanning the “Green Channel”
(XG)
Scanning the “Green Channel”
(XG)
Scanning the “Green Channel”
(XG)
Scanning the “Green Channel”
(XG)
Scanning the “Green Channel”
(XG)
Scanning the “Green Channel”
(XG)
Scanning the “Red Channel”
(XR)
Scanning the “Red Channel”
(XR)
Scanning the “Red Channel”
(XR)
Scanning the “Red Channel”
(XR)
Scanning the “Red Channel”
(XR)
Scanning the “Red Channel”
(XR)
XR
XG
log(X R )  log(XXGR)
XG
X R)
log(X R )  log(X
G
XG
X
X R)
log(X R )  log(XRG ) log(X R )  log(X G
XG
XG
R)
log(X R )  log(X
XG
XG
W1 C3
log(X R )  log(X G )
•Individual samples are no longer "free" to be assigned to any microarray –
restriction on the randomization process
•Measurements are "blocked" within a microarray (terminology)
•We could still randomly assign samples and not have treatment and the control on
each microarray, but this would be unreasonable (arguments to come)
•Need to use a paired t-test
1-11-2005
4
Paired t-test
• For a specific gene ri = xiw -xic = ith difference, i=1,…,6
2
• Statistical Model of observed data
ri ~ N (μ , σ )
• Differential expression    0
n
s2 
n
i
i 1

2
n 1
r
1
n
0.4
t* 
s
0.0
0.1
• "Null Distribution" is tdistribution with n-1
degrees of freedom
i 1
 (r  r )
0.3
• Calculating t-statistic
ˆ  r 
n
0.2
• Estimating parameters
 ri
1-11-2005
-4
-2
0
t-statistics
2
4
5
Two-sample t-test vs paired t-test
x2  x1
t 
2
sp
n
*
t2n-2
r
1
s
n
tn-1
•Denominator
1.51
0.04
•p-value
0.870
0.002
6
7
8
W
9
10
11
• Reference Distribution
t 
*
6
7
8
9
10
11
C
1-11-2005
6
Two-sample t-test vs paired t-test
Standard Deviations
0.0
0.0
0.5
0.5
1.0
Standard Deviation
2.0
1.5
1.0
Standard Deviation
1.5
2.5
3.0
2.0
Standard Deviations
Raw
1-11-2005
Paired TTest
Raw
Paired TTest
7
Two-sample t-test vs paired t-test
P-values
0.5
0.0
0
Raw
1-11-2005
1.0
-log10(p-value)
4
2
-log10(p-value)
6
1.5
8
2.0
P-values
Paired TTest
Raw
Paired TTest
8
Two-sample t-test vs paired t-test
20
sp

2
 2  s
n

1

n 
10
t *  2t *paired
0
Paired t-statistic
30
40
Two-sample vs Paired t-test
0
2
4
6
8
10
12
14
T statistic
1-11-2005
9
Two-sample t-test vs paired t-test
0.6
0.4
0.0
0.2
Paired t-test p-value
0.8
1.0
Two-sample vs Paired t-test
0.0
0.2
0.4
0.6
0.8
1.0
Two-sample t-test p-value
•Small advantage for two-sample t-test purely due to degrees of freedom
•Bigger possible advantage due to the smaller denominator (standard error)
1-11-2005
10
8.0
8.5
When is t-test "better" than paired t-test
paired t
Denominator
0.56
0.64
p-value
0.0008
0.0097
6.0
6.5
7.0
W
7.5
t-sample t
8.8
9.0
9.2
9.4
9.6
C
•Q: Can we use the two-paired
t-test in this case since it gives us a smaller
p-value?
•A: NO! Randomization and non-independence issues remain
1-11-2005
11
Multiple Factor Experiments - Incomplete Block
Design
Array
Cy 3
Control
Treatment
Cy 5
Control
Treatment 1
Treatment 2
1-11-2005
12
Multiple Factor Experiments - Incomplete Block
Design
•No color effect
•Homogeneous variance
•Optimal
•Homogeneous color effect
•Homogeneous variance
1-11-2005
•No color effect
•Homogeneous variance
•Sub-Optimal
•Homogeneous variance
13
Multiple Factor Experiments - Incomplete Block
Design
•Homogeneous Variance
T1
C
T1
T1 & T2
T2
1-11-2005
C
T1 & T2
T2
14
limma
... is a package for the analysis of microarray data, especially the use of
linear models for analyzing designed experiments and the
assessment of differential expression.
•
Specially constructed data objects to represent various aspects of
microarray data
•
Specially constructed "object methods" for importing, normalizing,
displaying and analyzing microarray data
•
Unique in the implementation of the empirical Bayes procedure for
identifying differentially expressed genes by "borrowing" information
from different genes (everything so far has been gene by gene)
1-11-2005
15