Transcript 投影片 1
Data Analysis for Gene Chip Data
Part I: One-gene-at-a-time methods
Min-Te Chao
2002/10/28
1
Outline
•
•
•
•
•
•
Simple description of gene chip data
Earlier works
Mutiple t-test and SAM
Lee’s ANOVA
Wong’s factor models
Efron’s empirical Bayes
2
Remarks
• Most works are statistical analysis, not
really machine learning type
• Very small set of training sample – not to
mention the test sample
• Medical research needs scientific rigor
when we can
3
Arthritis and Rheumatism
• Guidelines for the submission and reviews
of reports involving microarray technology
v.46, no. 4, 859-861
4
Reproducibility
• Should document the accuracy and
precision of data, including run-to-run
variability of each gene
• No arbitrary setting of threshold (e.g., 2fold)
• Careful evaluation of false discovery rate
5
Statistical Analysis
• Statistical analysis is absolutely necessary
to support claims of an increase or
decrease of gene expression
• Such rigor requires multiple experiments
and analysis of standard statistical
instruments.
6
Sample Heterogenenity
• … Strongly recommends that investigators
focus studies on homogenous cell
populations until other methodological and
data analysis problems can be resolved.
7
Independent Confirmation
• It is important that the findings be
confirmed using an independent method,
preferably with separate samples rather
than restating of the original mRNA.
8
Microarray
• Other terms:
DNA array
DNA chips
biochips
Gene chips
9
• The underlying principle is the same for all
microarrays, no matter how they are made
• Gene function is the key element
researchers want to extract from the
sequence
• DNA array is one of the most important
tools
(Nature, v.416, April 2002 885-891)
10
2 types of microarray
• cDNA
• Oligonucleotides
• DIY type
11
• Microarray
allows the researchers to determine which
genes are being expressed in a given cell
type at a particular time and under
particular condition
Gene-expression
12
Basic data form
• On each array, there are p “spots” (p>1000,
sometimes 20000). Each spot has k
probes (k=20 or so). There are usually 2k
measurements (expressions) per spot,
and the k differences, or the difference of
logs, are used.
• Sometimes they only give you a summary
statistics, e.g. median, mean,.. per spot
13
• Each spot corresponding to a “gene”
• For each study, we can arrange the chips
so that the i-th spot represents the i-th
gene. (genes close in index may not be
close physically at all)
• This means that when we read the i-th
spot of all chips in one study, we know we
get different measurements of the same ith gene
14
• Data of one chip can be arranged in a
matrix form,
Y; X_1, X_2, …, X_p
Just as in a regression setup. But in practice,
n (chips used) is small compared with p.
Y is the response: cell type, experimental
condition, survival time, …
15
• For a spot with 20 probes, see Efron et al.
(2001, JASA, p.1153).
16
Earlier works
• Cluster analysis
• Fold methods
• Multiple t with Bonferroni correction
17
Multiple t with Bonferroni correction
• It is too conservative
• Family wise error rate
Among G tests, the probability of at least
one false reject – basically goes to 1 with
exponential rate in G
18
Sidak’s single-step adjusted p-value
p’=1-(1-p)^G
Bonferroni’s single-step adjusted p-value
p’=min{Gp,1}
All are very conservative
19
FDR –false discovery rate
• Roughly: Among all rejected cases, how
many are rejected wrong?
(Benjamini and Hochberg 1995 JRSSB,
289-300) “Sequential p-method”
20
Sequential p-method
• Using the observed data, it estimates the
rejection regions so that the
FDR < alpha
Order all p-values, from small to large, and
obtain a k so the first k hypotheses (wrt
the smallest k p-values) are rejected.
21
• Since we have a different definition for
error to control, it will increase the “power”
• For modifications, see Storey (2002,
JRSSB, 479-498)
• These are criteria specifically designed to
handle risk assessment when G is large
22
Role of permutation
• For tests (multiple or not), it is important to
use a null distribution
• It is generated by a well-designed
permutation (of the columns of the data
matrix) –column refers to observations, not
genes.
23
One simple example
• Let us say we look at the first gene, with
n_1 arrays for treatment and n_2 arrays
for control
• We use a t-statistics, t_1, say. What is the
p-value corresponding to this observed
t_1?
24
• Permute the n=n_+n_2 columns of data of
the data matrix. Look at first row
(corresponds to the first gene)
• Treat the first n_1 numbers as a fake
“treatment”, the last n_2 numbers as a
fake “control” , compute a t-value, say we
get s_1
25
• Permute again and do the same thing and
we get s_2, ….
• Do it B times and get s_1, s_2, …., s_B
• Treat these s’s as a (bootstrap) sample for
the null distribution of the t_1 statistic
• The p-value of the earlier t_1 is found from
the ecdf of the s_j, j=1,2,…,B
26
• Permutation plays a major role --- finding a
reference measure of variation in various
situations
• For a well designed experiment with
microarray, DOE techniques will play an
important role in determining how to do
proper permutations.
27
SAM– significance analysis of
microarray
• A standard method of microarray analysis,
taught many times in Stanford short
courses of data mining
• Modified multiple t-tests
• Using the permutation of certain data
columns to evaluate variation of data in
each gene
28
• Original paper is hard to read:
(Tusher, Tibshirani and Chu, PNAS 2001,
v.98, no.9, 5116-5121)
But the SAM manual is a lot easier to read
for statisticians: (free software for
academia use)
29
• D(i)={X_treatment – X_control} over
{s(i)+s_0}
i=1,2,…,G
D(1)<D(2)<…..
Used in SAM, s_0 is a carefully determined
constant >0.
30
• D(i)* are used with certain group of
permutations of the columns; D(i)* are also
ordered
• Plot D vs. D*, points outside the 45-degree
line by a threshold Delta are signals of
significant expression change.
• Control the value of Delta to get different
FDR.
31
Other model-based methods
• Wong’s model
PM-MM= \theta \phi + \epsilon
Outlier detection
Model validation
Li and Wong (2001, PNAS v.98, no.1, 31-36)
32
Lee’s work
• ANOVA based
• May do unbalanced data – e.g., 7
microarray chips
(Lee et al. 2000, PNAS, v.97, 9834-9839)
33
Empirical Bayes
• (Efron et al. (2001) JASA, v.96, 1151-1160)
• Use a mix model
f(z)=p_0 f_0(z)+p_1 f_1(z)
with f_0, f_1 estimated by data.
p_1=prior prob that a gene expression is
affected (by a treatment)
34
• A key idea is to use permuted (columns)
data to estimate f_0
• Use a tricky logistic regression method
• Eventually found
p_1(Z)= the a posteriori probability that a
gene at expression level Z is affected
35
Part I conclusion
• Earlier methods are relatively easy to
understand, but to get familiar with the biolanguage needs time
• More powerful data analytic methods will
continue to develop
• It is important to first understand the basic
problems of biologist before we jump with
the fancy stat methods
36
• We may do the wrong problem …
• But if the problem is relevant, even simple
methods can get good recognition
• All methods so far are “first moment
only” – ie, not too much different from
multiple t tests; or, they all are one-geneat-a-time methods.
37
• We did not address issues about data
cleaning, outlier detection, normalization,
etc. Microarray data are highly noisy,
these problems are by no means trivial.
• As the cost per chip goes down, the
number of chips per problem may grow.
But still well-designed experiments, e.g.,
fractional factorial, has room to play in this
game
38
• Statistical methods, as compared with
machine learn based methods, will play a
more important role for this type of data
since, with a model, parametric or not, one
can attach a measure of confidence to the
claimed result. This is crucial for scientific
development.
39
Quote:
• The statistical literature for microarrays,
still in its infancy and with much of it
unpublished, has tended to focus on
frequentist data-analytical devices, such
as cluster analysis, bootstrapping and
linear models. (Efron, B. 2001)
40