Part 1 Microarray Timeseries Analysis with replicates OSM

Download Report

Transcript Part 1 Microarray Timeseries Analysis with replicates OSM

Summer Inst. Of Epidemiology and
Biostatistics, 2008:
Gene Expression Data Analysis
8:30am-12:30pm in Room W2017
Carlo Colantuoni – [email protected]
http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2008.htm
Class Outline
• Basic Biology & Gene Expression Analysis Technology
• Data Preprocessing, Normalization, & QC
• Measures of Differential Expression
• Multiple Comparison Problem
• Clustering and Classification
• The R Statistical Language and Bioconductor
• GRADES – independent project with Affymetrix data.
http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2008.htm
Class Outline - Detailed
•
Basic Biology & Gene Expression Analysis Technology
–
–
–
•
Data Preprocessing, Normalization, & QC
–
–
–
–
–
–
–
•
Bonferroni
False Discovery Rate Analysis (FDR)
Differential Expression of Functional Gene Groups
–
–
–
–
–
–
•
Basic Statistical Concepts
T-tests and Associated Problems
Significance analysis in microarrays (SAM) [ & Empirical Bayes]
Complex ANOVA’s (limma package in R)
Multiple Comparison Problem
–
–
•
Intensity Comparison & Ratio vs. Intensity Plots (log transformation)
Background correction (PM-MM, RMA, GCRMA)
Global Mean Normalization
Loess Normalization
Quantile Normalization (RMA & GCRMA)
Quality Control: Batches, plates, pins, hybs, washes, and other artifacts
Quality Control: PCA and MDS for dimension reduction
Measures of Differential Expression
–
–
–
–
•
The Biology of Our Genome & Transcriptome
Genome and Transcriptome Structure & Databases
Gene Expression & Microarray Technology
Functional Annotation of the Genome
Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum
Gene Set Enrichment Analysis (GSEA)
Parametric Analysis of Gene Set Enrichment (PAGE)
geneSetTest
Notes on Experimental Design
Clustering and Classification
–
–
–
Hierarchical clustering
K-means
Classification
•
•
•
LDA (PAM), kNN, Random Forests
Cross-Validation
Additional Topics
–
–
–
The R Statistical Language
Bioconductor
Affymetrix data processing example!
DAY #3:
Measures of Differential Expression:
Review of basic statistical concepts
T-tests and associated problems
Significance analysis in microarrays (SAM)
(Empirical Bayes)
Complex ANOVA’s (“limma” package in R)
Multiple Comparison Problem:
Bonferroni
FDR
Differential Expression of Functional Gene Groups
Notes on Experimental Design
Slides from Rob Scharpf
Fold-Change?
T-Statistics?
Some genes are more variable than others
Slides from Rob Scharpf
Slides from Rob Scharpf
Slides from Rob Scharpf
Slides from Rob Scharpf
distribution of
distribution of
Slides from Rob Scharpf
Slides from Rob Scharpf
X1-X2 is normally distributed if X1 and X2 are normally distributed – is this the case in microarray data?
Slides from Rob Scharpf
Problem 1: T-statistic not t-distributed.
Implication: p-values/inference incorrect
P-values by permutation
• It is common that the assumptions used
to derive the statistics are not
approximate enough to yield useful pvalues (e.g. when T-statistics are not T
distributed.)
• An alternative is to use permutations.
p-values by permutations
We focus on one gene only. For the bth iteration, b = 1,  , B;
1. Permute the n data points for the gene (x). The first n1 are
referred to as “treatments”, the second n2 as “controls”.
2. For each gene, calculate the corresponding two sample
t-statistic, tb.
After all the B permutations are done:
3. p = # { b: |tb| ≥ |tobserved| } / B
4. This does not yet address the issue of multiple tests!
Another problem with t-tests
The volcano plot shows, for a particular test, negative
log p-value against the effect size (M).
Remember this?
Problem 2: t-statistic bigger for genes with smaller standard
error estimates.
Implication: Ranking might not be optimal
Problem 2
• With low N’s SD estimates are unstable
• Solutions:
– Significance Analysis in Microarrays (SAM)
– Empirical Bayes meothods and Stein
estimators
Significance analysis in
microarrays (SAM)
• A clever adaptation of the t-ratio to
borrow information across genes
• Implemented in Bioconductor in the
siggenes package
Significance analysis of microarrays applied to the ionizing
radiation response, Tusher et al., PNAS 2002
SAM d-statistic
• For gene i :
yi  xi
di 
si  s0
y i  mean of sample 1



x i  mean of sample 2
si  Standard deviation of residuals for gene i
s0  Exchangeability factor estimated using all genes
Scatter plots of relative difference (d) vs standard
deviation (s) of repeated expression measurements
Random fluctuations
in the data, measured by
balanced permutations
(for cell line 1 and 2)
Relative difference for
a permutation of the data
that was balanced
between cell lines 1 and
2.
Selected genes:
Beyond expected distribution
SAM produces a modified T-statistic (d),
and has an approach to the multiple
comparison problem.
eBayes: Borrowing Strength
• An advantage of having tens of thousands of genes is
that we can try to learn about typical standard deviations
by looking at all genes
• Empirical Bayes gives us a formal way of doing this
• “Shrinkage” of variance estimates toward a “prior”:
moderated t-statistics
• Implemented in the limma package in R. In addition,
limma provides methods for more complex experimental
designs beyond simple, two-sample designs.
The Multiple Comparison
Problem
(some slides courtesy of John Storey)
Hypothesis Testing
•
Test for each gene:
Null Hypothesis: no differential expression.
•
Two types of errors can be committed
–
Type I error or false positive (say that a gene is differentially
expressed when it is not, i.e., reject a true null hypothesis).
–
Type II error or false negative (fail to identify a truly
differentially expressed gene, i.e.,fail to reject a false null hypothesis)
Hypothesis testing
• Once you have a given score for each gene,
how do you decide on a cut-off?
• p-values are most common.
• How do we decide on a cut-off when we are
looking at many 1000’s of “tests”?
• Are 0.05 and 0.01 appropriate? How many
false positives would we get if we applied
these cut-offs to long lists of genes?
Multiple Comparison Problem
• Even if we have good approximations of
our p-values, we still face the multiple
comparison problem.
• When performing many independent
tests, p-values no longer have the same
interpretation.
Bonferroni Procedure
a = 0.05
# Tests = 1000
a = 0.05 / 1000 = 0.00005
or
p = p * 1000
Bonferroni Procedure
Too conservative.
How else can we interpret many
1000’s of observed statistics?
Instead of evaluating each statistic
individually, can we assess a list of
statistics: FDR (Benjamini & Hochberg 1995)
FDR
• Given a cut-off statistic, FDR gives us an
estimate of the proportion of hits in our list of
differentially expressed genes that are false.
Null = Equivalent Expression; Alternative = Differential Expression
False Discovery Rate
• The “false discovery rate” measures the proportion of false
positives among all genes called significant:
# false positives
# called significant
• This is usually appropriate because one wants to find as
many truly differentially expressed genes as possible with
relatively few false positives
• The false discovery rate gives an estimate of the rate at
which further biological verification will result in dead-ends
Distribution of Statistics
Distribution of Observed (black) and Permuted (red+blue) Correlations (r)
N=90
2
1
Observed
0
Density
3
Permuted
-0.4
-0.2
0.0
Statistic
Correlation (r)
0.2
0.4
Correlation of Age with Gene Expression
False Pos.
FDR =
Total Pos.
0.10
Permuted
=
Observed
0.05
Observed
Permuted
0.00
Density
0.15
0.20
Distribution of Observed (black) and Permuted (red+blue) Correlations (r)
-0.45
-0.40
Correlation
(r)
Correlation (r)
-0.35
-0.30
Distribution of p-values
Distribution of p-values from Observed (Black) and Permuted Data
2.0
N=90
0.5
1.0
Permuted
0.0
Density
1.5
Observed
0.0
0.2
0.4
p-value
p-value
0.6
0.8
1.0
FDR = False Positives/Total Positive Calls
This FDR analysis requires enough samples in
each condition to estimate a statistic for each
gene: observed statistic distribution.
And enough samples in each condition to
permute many times and recalculate this
statistic: null statistic distribution.
What if we don’t have this?
FDR = 0.05
Beyond ±0.9
FDR = 0.05
Beyond ±0.9
False Positive Rate
versus False Discovery Rate
• False positive rate is the rate at which
truly null genes are called significant
# false positives
FPR 
# truly null
• False discovery rate is the rate at which
significant genes are truly null
FDR 
# false positives
# called significant
False Positive Rate and P-values
• The p-value is a measure of significance
in terms of the false positive rate (aka
Type I error rate)
• P-value is defined to be the minimum
false positive rate at which the statistic
can be called significant
• Can be described as the probability a
truly null statistic is “as or more
extreme” than the observed one
False Discovery Rate and Q-values
• The q-value is a measure of
significance in terms of the false
discovery rate
• Q-value is defined to be the minimum
false discovery rate at which the statistic
can be called significant
• Can be described as the probability a
statistic “as or more extreme” is truly
null
Power and Sample Size
Calculations are Hard
• Need to specify:
–
–
–
–
–
a (Type I error rate, false positives) or FDR
s (stdev: will be sample- and gene-specific)
Effect size (how do we estimate?)
Power (1-b, b=Type II error rate)
Sample Size
• Some papers:
– Mueller, Parmigiani et al. JASA (2004)
– Rich Simon’s group Biostatistics (2005)
– Tibshirani. A simple method for assessing sample
sizes in microarray experiments. BMC
Bioinformatics. 2006 Mar 2;7:106.
Beyond Individual Genes:
Functional Gene Groups
• Borrow statistical
power across entire
dataset
• Integrate preexisting
biological knowledge
• Beyond threshold
enrichment
Functional Annotation of Lists of Genes
KEGG
PFAM
SWISS-PROT
GO
DRAGON
DAVID/EASE
MatchMiner
BioConductor (R)
Gene Cross-Referencing and Gene
Annotation Tools In BioConductor
(in the R statistical language)
annotate package
Microarray-specific “metadata” packages
DB-specific “metadata” packages
AnnBuilder package
Annotation Tools In BioConductor:
annotate package
Functions for accessing data in metadata packages.
Functions for accessing NCBI databases.
Functions for assembling HTML tables.
Annotation Tools In BioConductor:
Annotation for Commercial Microarrays
Array-specific metadata packages
Annotation Tools In BioConductor:
Functional Annotation with other DB’s
GO metadata package
Annotation Tools In BioConductor:
Functional Annotation with other DB’s
KEGG metadata package
Functional Gene Subgroups within An Experiment
0
.
3
Swiss-Prot
0
.
2
EXP#1
0
.
1
PFAM
0
.
0
4
0
.
3
2
0
2
4
2
0
2
4
2
0
2
4
0
.
2
0
.
1
KEGG
0
.
0
4
0
.
3
0
.
2
0
.
1
0
.
0
4
Threshold Enrichment: One Way of Assessing
Differential Expression of Functional Gene Groups
Threshold Enrichment: One Way of Assessing
Differential Expression of Functional Gene Groups
Statistics for Analysis of Differential
Expression of Gene Subgroups
Is THIS …
0
.
3
… Different
from THIS?
0
.
2
0
.
1
0
.
0
4
2
0
2
4
Over-Expression of a Group of Functionally Related Genes
p<7.42e-08
T statistic
Is THIS …
0
.
3
… Different
from THIS?
0
.
2
Conceptually Distinct from Threshold
Enrichment and the Hypergeometric test!
0
.
1
0
.
0
4
Statistical Tests:
2
0
2
4
c2
Kolmogorov-Smirnov
Product of Probabilities
GSEA
PAGE
geneSetTest (Wilcoxon rank sum)
c2
All Genes
c2 is the sum of D values where:
(O-E)2
______
D=
E
Subset of Interest
histogram
bins
E
O
Kolmogorov-Smirnov
All Genes
Subset of Interest
Product of Individual Probabilities
All Genes
Subset of Interest
What shape/type of distributions would each of
these tests be sensitive to?
All statistics
Statistics from
gene subgroup
Gene Set Enrichment Analysis (GSEA)
Subramanian et al, 2005 PNAS
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA)
Parametric Analysis of Gene Set Enrichment (PAGE)
Kim et al, 2005 BMC Bioinformatics
Parametric Analysis of Gene Set Enrichment (PAGE)
Z =
Sm-m
0.5
s/m
A simple method in Bioconductor
geneSetTest(limma)
Test whether a set of genes is enriched for differential expression.
Usage:
geneSetTest(selected,statistics,alternative="mi
xed",type="auto",ranks.only=TRUE,nsim=10000)
The test statistic used for the gene-set-test is the mean of the statistics in the set. If
ranks.only is TRUE the only the ranks of the statistics are used. In this case the pvalue is obtained from a Wilcoxon test. If ranks.only is FALSE, then the p-value is
obtained by simulation using nsim random selected sets of genes.
Wilcoxon test
0
.
3
0
.
2
0
.
1
0
.
0
4
2
0
2
4
Analysis of Gene Networks
Large Protein Interaction Network
Network Regulated
in Sample #1
Large Protein Interaction Network
Network Regulated
in Sample #2
Network Regulated
in Sample #1
Large Protein Interaction Network
Network Regulated
in Sample #3
Network Regulated
in Sample #1
Network Regulated
in Sample #2
Large Protein Interaction Network
Network
of Interest
Network Regulated
in Sample #1
Network Regulated
in Sample #2
Network Regulated
in Sample #3
Additional Notes on
Experimental Design
Replicates in a mouse model:
Dissection of
tissue
Biological Replicates
RNA
Isolation
Amplification
Technical Replicates
Probe
labelling
Hybridization
Common question in
experimental design
• Should I pool mRNA samples across
subjects in an effort to reduce the effect
of biological variability (or cost)?
Two simple designs
• The following two designs have roughly
the same cost:
– 3 individuals, 3 arrays
– Pool of three individuals, 3 technical
replicates
• To a statistician the second design
seems obviously worse. But, I found it
hard to convince many biologist of this.
– 3 pools of 3 animals on individual arrays?
Cons of Pooling Everything
• You can not measure within class variation
• Therefore, no population inference possible
• Mathematical averaging is an alternative way of
reducing variance.
• Pooling may have non-linear effects
• You can not take the log before you average:
E[log(X+Y)] ≠ E[log(X)] + E[log(Y)]
• You can not detect outliers
*If the measurements are independent and identically distributed
Cons specific to microarrays
• Different genes have dramatically
different biological variances.
• Not measuring this variance will result in
genes with larger biological variance
having a better chance of being
considered more important
Higher variance: larger fold change
We compute fold change for each gene (Y axis)
From 12 individuals we estimate gene specific variance (X axis)
If we pool we never see this variance.
Remember this?
Useful Books:
“Statistical analysis of gene expression
microarray data”
– Speed.
“Analysis of gene expression data”
– Parmigianni
“Bioinformatics and computational biology
solutions using R”
- Irizarry