Transcript ppt

Identifying expression differences in
cDNA microarray experiments
Lecture 20, Statistics 246,
April 6, 2004
1
Introduction
•
•
•
•
•
Many microarray experiments are carried out to find genes which are
differentially expressed between two (or more) samples of cells. Examples
abound:
cells (from the liver, say), in a mouse with a gene knocked out, compared
with liver cells in a normal mouse of the same strain
cells in one region of the brain (say cerebellum), compared with cells in a
different region (say the anterior cingulate region)
tumor cells in some organ (say the liver), compared with normal cells from
the same organ
cells from an organism (say yeast) after a treatment (say by heat, or cold, or
a drug) compared with cells of the same kind in the untreated state
cells from some part of a developing organ or organism at one time,
compare with cells of the same kind at a later time, and so on
It should be clear that the number of such comparisons is limited only by the
imagination of the biologist, at least at the moment, when details of so many
genetic programs (drug response, development, tumorigenesis, ..) are
incomplete.
2
Preliminary remarks
Initially, comparative microarray experiments were done with few, if any
replicates, and statistical criteria were not used for identifying differentially
expressed genes. Instead, simple criteria were used such as fold-change,
with 2-fold being a popular cut-off. This was sometimes done without
regard to the variability present in the experiment, and, depending on the
experiment, could be too liberal, naming genes that were not differentially
expressed (false positives, or errors of the first kind), or too conservative,
failing to identify genes that were differentially expressed (false negatives,
errors of the second kind).
The relative importance of false positives and false negatives, depends on
the context: the aims of the experiment (e.g. were the investigators seeking
broad patterns, or specific genes), and the follow-up experiments
envisaged (e.g. validation of findings by a more precise technique).
It did not take long for people to want to assign statistical significance to their
findings concerning differentially expressed genes. Could p-values be
attached, confidence statements be made, and so on? These questions
raised a number of issues which were unfamiliar to the molecular
biologists doing the experiments: replication, systematic versus random
differences, multiplicity of tests, etc.
3
Preliminary remarks, cont.
It was eventually realized on that biological replicates are important for
reaching statistically sound conclusions, but even here the story is not
simple, as many systematic features remain even in experiments with
“independent” replicates.
The fact that many thousands of comparisons were being carried out made
use of traditional cut-offs (+/- 2SD, or p < 0.05) inappropriate also became
clear to people. Strict control of type 1 error rates (we’ll be more precise in
the next lecture) turned out to be asking too much in this microarray context,
and different criteria for “controlling” errors rates have come to the fore,
most notably false discovery rates (FDR). However, there still remains a
need for appropriate theory. Modelling large-scale microarray experiments is
not a simple task: the difference between assuming and actually having
independence can be great.
Where are we now? Depending on the context, some researchers can make
use of traditional multiple testing procedures. Others make use of FDR
notions, which are more widely applicable, but lead to weaker conclusions.
Yet others have realized that is is unreasonable (given their sample sizes)
to expect “statistical significance” for all their results, and instead seek
4
evidence of “biological significance”, and validation by suitable follow-up.
Preliminary remarks, concluded
Where are we now?, cont. Two issues have come to the fore in recent years.
The first is the interpretation of the lists of genes determined (by whatever
means) to be differentially expressed: What kinds genes are they, i.e. what is
their function (DNA binding, protease,..) ? What cellular pathways or
processes are involved (replication, cell death..)? Where in the cell are these
genes operating (in ribosomes, mitochondria,..)? The question becomes: are
genes of a given class over-represented in the list of differentially expressed
genes? As there are many classes, there are many such questions, and so
the issue of multiplicity comes up once more.
The second is the determination of significance for sets of genes given a
priori, attempting to answer some form of the question: is this (specific) set
of genes differentially expressed? The idea here is that a particular set of
genes (say those involved in oxidative phosphorylation) might have all or
many changed a little, and that this pattern of change might be “significant”,
although the individual changes are not. Of course we won’t begin with just
one set of genes, but many, so multiplicity questions arise here too.
In both of these questions, how we rank the genes will be important, and
5 the
cut-offs less so.
Some motherhood statements
Important aspects of a statistical analysis include:
• Tentatively separating systematic from random sources of
variation
• Removing the systematic and quantifying the random, when the
system is in control
• Identifying and dealing with the most relevant source of
variation in subsequent analyses
Only if this is done can we hope to make more or less valid
probability statements.
6
The simplest cDNA microarray data analysis problem is
identifying differentially expressed genes using one slide
•
•
•
•
This is a common enough hope
Efforts are frequently successful
It is not hard to do by eye
The problem is probably beyond formal statistical inference (valid pvalues, etc) for the foreseeable future….why?
In the next two panels, genes found to be up- or down-regulated in an 8
treatment (Srb1 over-expression) versus 8 control comparison are
indicated in red and green, respectively, on plots of the data from single
hybridizations. Also depicted are “confidence lines” determined by
different people, which claim to be able to delineate differentially
expressed genes using just one hybridization (slide), the different lines
corresponding to different methods and/or different “confidence” levels.
What do we see?
7
Matt Callow’s Srb1 dataset (#5).
Newton’s and Chen’s single slide method
8
Matt Callow’s Srb1 dataset (#8).
9
Newton’s, Sapir & Churchill’s and Chen’s single slide method
The second simplest cDNA microarray data analysis
problem is identifying differentially expressed
genes using replicated hybridizations
There are a number of different aspects:
• First, between-slide normalization; then
• What should we look at: averages, SDs,
t-statistics, other summaries?
• How should we look at them?
• Can we make valid probability statements?
10
Apo AI experiment (Matt Callow, LBNL)
Goal. To identify genes with altered expression in the livers of
Apo AI knock-out mice (T) compared to inbred C57Bl/6 control
mice (C).
• 8 treatment mice and 8 control mice
• 16 hybridizations: liver mRNA from each of the 16 mice
(Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the
control mice (C*) is labelled with Cy3.
• Probes: ~ 6,000 cDNAs (genes), including 200 related to
lipid metabolism.
11
12
Which genes have changed?
When permutation testing possible
1. For each gene and each hybridization (8 ko + 8 ctl), use
M=log2(R/G).
2. For each gene form the t-statistic:
average of 8 ko Ms - average of 8 ctl Ms
sqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2)
3. Form a histogram of 6,000 t-values.
4. Do a normal qq-plot; look for values “off the line”.
5. Permutation testing (next lecture).
6. Adjust for multiple testing (next lecture).
13
Histogram & normal qq-plot of t-statistics
ApoA1
14
What is a normal qq-plot?
We have a random sample, say ti, i=1, …,n, which we believe
might come from a normal distribution. If it did, then for suitable
 and , ((ti-)/), i=1,…n would be uniformly distributed on
[0,1](why?), where  is the standard normal c.d.f.. Denoting the
order statistics of the t-sample by t(1) ,t(2) ,….,t(n) we can then see
that ((t(i) -)/) should be approximately i/n (why?). With this in
mind, we’d expect t(i) to be about -1(i/n) +  (why?).
Thus if we plot t(i) against -1((i+1/2/(n+1)), we might expect to
see a straight line of slope about  with intercept about . (The
1/2 and 1 in numerator and denominator of the i/n are to avoid
problems at the extremes.)
This is our normal quantile-quantile plot, the i/n being a quantile
of the uniform, and the -1 being that of the normal.
15
Why do a normal q-q plot?
One of the things we want to do with our t-statistics is roughly
speaking, to identify the extreme ones.
It is natural to rank them, but how extreme is extreme? Since
the sample sizes here are not too small ( two samples of 8 each
gives 16 terms in the difference of the means), approximate
normality is not an unreasonable expectation for the null
marginal distribution.
Converting ranked t’s into a normal qq-plot is a great way to
see the extremes: they are the ones that are “off the line”, at
one end or another. This technique is particularly helpful when
we have thousands of values. Of course we can’t expect all
differentially expressed genes to stand out as extremes: many
will be masked by more extreme random variation, which is a
big problem in this context. See next lecture for a discussion
16 of
these issues.
gene
t
index
statistic
2139
-22
Apo AI
4117
-13
EST, weakly sim. to STEROL DESATURASE
5330
-12
CATECHOL O-METHYLTRANSFERASE
1731
-11
Apo CIII
538
-11
EST, highly sim. to Apo AI
1489
-9.1
EST
2526
-8.3
Highly sim. to Apo CIII precursor
4916
-7.7
similar to yeast sterol desaturase
941
-4.7
2000
+3.1
5867
-4.2
4608
+4.8
948
-4.7
5577
-4.5
Gene annotation
17
Useful plots of t-statistics
18
Which genes have changed?
Permutation testing not possible
Our current approach is to use averages, SDs, t-statistics
and a new statistic we call B, inspired by empirical
Bayes. It is equivalent to a moderated t-statistic.
We hope in due course to calibrate B and use that as our
main tool.
We begin with the motivation, using data from a study in
which each slide was replicated four times.
19
Results from 4 replicates
Points to note
One set (green) has a high average M but also a high variance and a low t.
Another (pale blue) has an average M near zero but a very small variance,
leading to a large negative t.
A third (dark blue) has a modest average M and a low variance, leading to a
high positive t.
A fourth (purple) has a moderate average M and a moderate variance, leading
to a small t.
Another pair (yellow, red) have moderate average Ms and middling variances,
and moderately large ts.
Which do we regard most favourably? Let’s look at M and t jointly.
21
•M
•t
•t M
Sets defined by cut-offs: from the Apo AI ko experiment
22
•M
•t
•t M
Results from the Apo AI ko experiment
23
•M
•t
•t M
Apo AI experiment: t vs average A.
24
What have we concluded?
Using average M alone, we ignore useful information in the SD across
replicates. Some large values are large because of outliers.
Using t alone, we are vulnerable to being misled by small SDs. With 1000s
of genes, some SDs will be very small. Some will be very large too.
Formal testing can sort out these issues for us, but if we simply want to rank,
what should we rank on? Neither average M nor t seems appropriate.
One approach (SAM) is to inflate the SDs slightly. Another approach is based
on the following empirical Bayes story. It can go further. There are a number
of variants.
25
An empirical Bayes story
Suppose that our M values are independently and normally distributed, and
that a proportion p of genes are differentially expressed, i.e. have M’s with
non-zero means.
Further, suppose that the variances and means of these are chosen jointly
from inverse chi-square and normal conjugate priors, respectively.
Genes not differentially expressed have zero means, and variances chosen
from the same inverse chi-squared distribution.
The scale and d.f. parameters in the inverse chi-square are estimated from
the data, as is a parameter c connecting the prior for the mean with that for
the variances.
We then calculate for the posterior probability that a given gene is
differentially expressed, and find it is an increasing function of B over the
page, where a and c are estimated parameters, and p is in the constant.
26
Empirical Bayes log posterior odds ratio (LOR)
 2a
2
2 
 s  M
 n

B  const  log
2
2a  s2  M 
 n
1  nc 
Notice that for large n this approximately t=M./s .
27
B=LOR compared with t and M.
28
Extensions include dealing with
•
•
•
•
Replicates within and between slides
Several effects: use a linear model
ANOVA: are the effects equal?
Time series: selecting genes for trends
29
Summary (for the second simplest problem)
• Microarray experiments typically have thousands of
genes, but only few (1-10) replicates for each gene.
• Averages can be driven by outliers.
• t-statistics can be driven by tiny variances.
• B = LOR will, we hope
– use information from all the genes
– combine the best of M. and t
– avoid the problems of M. and t
Ranking on B could be helpful.
30
Acknowledgments
Yee Hwa Yang
Sandrine Dudoit
Ingrid Lönnstedt
Natalie Thorne
David Freedman
Ngai lab, UCB
Matt Callow (LBNL)
31