Introduction to R

Transcript Introduction to R

Microarray Data Analysis (I)
17/06/2010
Daniele Merico, PhD
Microarray Data Analysis (I)
Part 1 / 5
Introduction
A brief history of life
Microarray Chronology
4000
3500
3000
2500
2000
1500
1000
500
Number of PubMed publications by year
Using a query containing keywords such as microarray, transcriptomics, etc..
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
Microarray Chronology
4000
3500
3000
2500
2000
First Microarray
Publication [1]
1500
45 Arabidopsis
genes
1000
500
[1] Schena M, Shalon D, Davis RW, Brown PO.; Quantitative monitoring of gene expression
patterns with a complementary DNA microarray.; Science. 1995 Oct 20;270(5235):467-70.
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
Microarray Chronology
4000
3500
3000
2500
2000
Full Yeast
Genome on
microarray [2]
1500
1000
500
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
[2] Lashkari DA, DeRisi JL, McCusker JH, Namath AF, Gentile C, Hwang SY, Brown PO, Davis RW.
Yeast microarrays for genome wide parallel genetic and gene expression analysis. Proc Natl Acad
Sci U S A. 1997 Nov 25;94(24):13057-62
Microarray Chronology
4000
3500
3000
2500
Gene Ontology
Consortium.
2000
Hierarchical
Clustering and
heat-maps [3]
1500
1000
500
[3] M. B. Eisen, P. T. Spellman, P. O. Brown and D. Botstein, Cluster analysis and display of
genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95 (1998), pp. 14863–14868.
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
Microarray Chronology
4000
3500
3000
2500
2000
1500
Gene Ontology
enrichment,
1000
(hypergeometric)
500
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
Microarray Chronology
4000
Gene expression
profiling on
interaction
networks [4]
3500
3000
2500
2000
1500
1000
500
[4] Discovering regulatory and signalling circuits in molecular interaction networks. Ideker T, Ozier
O, Schwikowski B, Siegel AF. Bioinformatics. 2002;18 Suppl 1:S233-40.
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
Microarray Chronology
4000
Full Human
Genome on
microarray
3500
3000
Affymetrix HGU133 plus 2.0
2500
2000
1500
1000
500
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
Microarray Chronology
4000
3500
3000
2500
GSEA
Enrichment [5]
2000
1500
Gene Ontology,
Pathways, other
gene-sets
1000
500
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
[5] Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A,
Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based
approach for interpreting genome-wide expression profiles.
Proc Natl Acad Sci U S A. 2005 Oct 25;102(43):15545-50
Microarray Data Analysis (I)
Part 2 / 5
Experimental Design
Overall Analysis Workflow
Experimental Design
– Define your biological question
• E.g. how is transcription affected by estrogen
stimulation of breast cancer cells?
– Define classes, i.e. biological conditions/systems
you intend to sample
• E.g.: cerebellum from alzheimer patients, HEK cell line
treated with miR-21 antogomir, primary fibroblasts
after 2 days of culture
– Factors can be combined to define classes
• E.g. trated vs untreated, 12 vs 24 hours
– Plan how many replicates you need
Replicates
– You need replicates to estimate the variability of your
measurements
– Without variability estimates you cannot make
statistically significant observations
– Biological Replicate
• Make a new RNA sample from an independent source
(e.g. new animal, or parallel cell line culture)
– Technical Replicate
• Use the same RNA sample for a new hybridization
– With Affymetrix microarrays, you will need only
biological replicates
Experimental Design: Tissue Specificity
Class Neuron
Class Gland
Class Bone
Class Blood
Affy ID
Neuron
Gland
Bone Osteoblast
Blood White Cell
98063_at
1138.4
127.1
54.3
26.0
100080_at
17.0
592.5
27.2
372.8
103012_at
672.4
792.9
510.9
850.3
…
Expression Matrix
Expression
Signal
Experimental Design: Disease
Disease
state
Wild Type
Heart Disease
Time
Experimental
Design Matrix
(2 Factors)
08 w
16 w
24 w
Heart Disease
3
3
3
Wild Type
3
3
3
Number of
replicates
Experimental Design: Important Points
– Clearly define your biological question(s)
– Replicate experiments
• Biological variability must be factored-in through replication:
repeat the experiment using different biological samples
– Use clear and balanced designs
• Use the same number of replicates in every class
– Minimize experimental variability
• Experimental variability arises from different platforms, different
protocols, different experimenters, different days, etc…
• Minimize all these factors
– Control the assumptions of your design
• Many studies on human patients assume two-class designs; however, the
patients may exhibit heterogeneous phenotypes (e.g. different cancer
stages) and hence different transcriptomes
• The Explorative Analysis might reveal a different picture than you
expected
Overall Analysis Workflow
Define the
experimental design
Generate the
expression signals
Explorative Analysis
and Pre-processing
Select Diff. Genes
Group into Clusters
Identify the
Functional Groups
In this lesson…
Define the
experimental design
Generate the
expression signals
Explorative Analysis
and Pre-processing
Select Diff. Genes
Group into Clusters
Identify the
Functional Groups
Microarray Data Analysis (I)
Part 3 / 5
Affymetrix microarrays
CEL files
rma signals and MAS-5 detection
Oligonucleotide Microarray
Oligonucleotide Microarray Technology
A transcript is recognized by 11-20 probe
pairs 25 nucleotides long
Raw fluorescence image
Perfect match (PM) and mismatch (MM)
Primary Signals
– After essential image processing, we have signals for every probe
We need to integrate those signals into transcript/gene signals
– Different techniques are available; two of the most popular ones are:
– MAS-5 detection p-value
• p-value on a test of presence/absence of the transcript
• relies on perfect match (PM) and mismatch (MM) probe signals
• used for tissue-specificity or for pre-filtering
– rma signal
• continuous signal
• relies only on perfect match probe signal
• pre-normalized (no further normalization required)
• used for differential expression (i.e. comparing two or more classes)
Enter the Matrix…
Linear rma signal matrix
.CEL Files
– .CEL files have are produced by the Affymetrix proprietary
software after essential image processing
– You can download CEL files from databases (e.g. GEO,
Gene Expression Omnibus) and analyze them using rma
(available within a R / Bioconductor analysis package)
– Otherwise, pre-processed data are available in SOFT or
MINiML textual formats
– If Affymetrix data has not been processed using rma, and
.CEL files are available, it’s better if you download .CEL files
and process them yourself
– .CEL are also valuable to compute A/P counts
Importing .CEL files
– Download the CEL files from the GEO database
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11352
– Decompress the files (use winRAR, 7zip, …)
– Open R
– Install several analysis packages from Bioconductor
(see the last part for a detailed description of Bioconductor)
source ("http://bioconductor.org/biocLite.R")
biocLite ()
– Set working directory
– Load library
library (affy)
Importing .CEL files
– This data-set is composed by 6 classes and 18 samples (3
biological replicates per class)
– The major factor is estrogen treatment or no treatment of
MCF7 cell line (originally derived from breast cancer)
– Treated and untreated cells are sampled at three different
time-points: 12, 24, 48 hours
treated
untreated
12h
3
3
24h
3
3
48h
3
3
Importing .CEL files
– Sample annotation (class, replicate#, etc…)
cv_time.fct <- factor (c (
rep ("12h", 6),
rep ("24h", 6),
rep ("48h", 6)
))
cv_trtm.fct <- factor (rep (c (
rep ("trt", 3),
rep ("unt", 3)),
3))
sample.names <- paste (
cv_time.fct,
cv_trtm.fct,
rep (1: 3, 6),
sep = " _"
)
Importing .CEL files
– Sample annotation (class, replicate#, etc…)
cv_time.fct <- factor (c (
rep ("12h", 6),
rep ("24h", 6),
rep ("48h", 6)
))
cv_trtm.fct <- factor (rep (c (
rep ("trt", 3),
rep ("unt", 3)),
3))
sample.names <- paste (
cv_time.fct,
cv_trtm.fct,
rep (1: 3, 6),
sep = "_"
)
Importing .CEL files
es.df <- data.frame (
Time = cv_time.fct,
Trtm = cv_trtm.fct,
row.names = sample.names
)
es.mData <- data.frame (
labelDescription = c (
"Culture time",
"Estrogen treatment")
)
# class: phenoData
es.pData <- new ("AnnotatedDataFrame",
data = es.df,
varMetadata = es.mData
)
rm (es.df, es.mData)
Importing .CEL files
– Access phenodata
pData (es.pData)
– Read AffyBatch
es.affy <- read.affybatch (
filename = list.files (),
phenoData = es.pData
)
– AffyBatch object class stores all the raw data in cell files
Calculate rma signals
# rma signals, ExpressionSet object
es.eSet <- rma (es.affy)
# EXPRESSION MATRIX
# log2 scale (default)
es_log2.mx <- exprs (es.eSet)
# linear scale
es_lin.mx <- 2 ^ es_log2.mx
Calculate MAS-5 Detection
– MAS-5 detection output will be in the form of “calls”:
A (absent), M (marginal), P (present)
• Only a few probe-sets will be M, so it’s common to work
on A and P (e.g. A or P %)
es_calls.eSet <- mas5calls.AffyBatch (es.affy)
es_calls.mx <- exprs (es_calls.eSet)
Microarray Data Analysis (I)
Part 4 / 5
Explorative Analysis
Quality Control
Pre-processing
Explorative Analysis Concepts
– Sample signal distribution
• You need to check if signals, on a by-sample basis, are
affected by global biases
• Such biases can be corrected, I they are not too extreme
• Extreme or persisting biases are sign of sample preparation
or hybridization problems
– Similarity relations within and between classes
• You need to check if relations between samples satisfy the
assumptions behind your experimental design
• Samples from the same class are expected to be more
similar than samples from different classes
Signal Distribution
– Use boxplots on log2-scale signals
boxplot (
as.data.frame (es_log2.mx),
xlab = "", ylab = "Log2 rma signal", las = 2,
main = "Sample Distributions")
– If you have used rma, you will notice that boxplots are
extremely similar
– That’s because rma incorporates a normalization step
• Normalization is a pre-processing technique that corrects for
sample signal distribution differences among samples
A/P Detection Rates
– P (or A) rates are independent of rma and its
normalization
– The index should be pretty similar across samples
– If there are strong differences, biases are likely impossible
to remove even using normalization
es_P_rate.nv <- apply (es_calls.mx == "P", 2, sum) /
nrow (es_calls.mx)
barplot (
es_P_rate.nv,
xlab = "", ylab = "% P probe-sets",
main = "P Rates", las = 2)
Normalizing Signal Distribution
– If you are not using Affymetrix + rma, you will
have to take care of normalization yourself
– We don’t have enough time here to review the
different procedures
– Quantile normalization is often regarded as a
standard
– If you are using Illumina microarrays, follow the
instructions in the R / Bioconductor package lumi
Methods to explore
relations among samples
– Hierarchical clustering
• Groups samples into hierarchical groups, using a
similarity measure (comparison of gene expression
levels of samples)
• Hierarchical groups are visualized using a dendrogram
– PCA
• Projects samples into a new space, by capturing
common patterns of gene expression
• Usually a samples are visualized in 2D space (biplot)
Hierarchical Clustering
1. Define a similarity metric
2. Compute similarity among samples
•
Recommended:
(1 - pearson correlation) on log2 rma signals
3. Run hierarchical clustering algorithm
4. Plot dendrogram
5. Analyze if sample groups correctly reproduce
classes, according to the experimental design
Hierarchical Clustering: Dendrogram
– Hierarchical clustering enables to summarize the similarity
structure of the samples
• which samples are most similar and can be grouped together
• what are the similarity relations between such groups
The distance is
proportional to dissimilarity
(short  similar)
This diagram is called
dendrogram
Hierarchical groups
Hierarchical Clustering: Dendrogram
Interpretation with respect to a simplified data space
Gene 1
D
E
B
A
B
C
D
E
A
C
Gene 2
Dashed lines represent
distances in the original data
space
Hierarchical Clustering: Case Study 1
–Hierarchical clustering for the estrogen treatment data-set:
classes are well separated
Please see PCA Case Study 1 for
more insight on this data-set
and a comparison of clustering
and PCA results
treated
untreated
12h
3
3
24h
3
3
48h
3
3
Hierarchical Clustering: Case Study 2
– Hierarchical clustering can reveal sample anomalies
Somebody had fun the night
before doing this experiment…
Hierarchical Clustering: Case Study 3
– Hierarchical clustering can reveal poor separation
between classes
Hierarchical Clustering: R Code
# Following up the Estrogen data-set analysis
f.distCol.Pearson <- function (x.mx)
{
dist.mx <- 1 - cor (x.mx, method = "pearson")
return (as.dist (dist.mx))
}
es.hclust <- hclust (
d = f.distCol.Pearson (es_log2.mx))
plot (
es.hclust,
main = "Hierarchical Clustering",
xlab = "1 - Pearson Correlation")
PCA
– PCA identifies new dimensions in the data, which are
combinations of existing dimensions
– The new dimensions are ranked according to the amount
of variability (often informally interpreted as information)
– Since only a few new dimensions are used to explore the
data (most often only 1 or 2), this means we have
“projected” the original data-space into a smaller one
(data dimensionality reduction)
PCA: Projection
The objects in
a 3D space
Projection to
2D space
PCA: Projection
– In microarray explorative analysis
• Samples are treated as objects
• Genes/transcripts are treated as dimensions
samples
genes
samples
Principal Components
(“meta-genes”)
PCA: Interpretation
– Here we focus on the PCA of samples as objects,
although PCA of genes as objects can be done as
well (useful to identify functional groups)
– The new dimensions are interpreted as
meta-genes
i.e. the distilled expression patterns across
samples of all genes in the data-set
• PCs are all orthogonal, i.e. they have null correlation
PCA: Interpretation
– If the factors in the experimental design have a strong
effect of gene expression, PCA is expected to reveal
groups of samples corresponding to classes
– Different PC often “explain” different factors of the
experimental design
– Note:
It is important to standardize the data before running PCA (refer
to the R code)
Otherwise, the first principal component will unlikely account for
variability pertinent to the experimental design
PCA: Biplot Visualization
samples
PC1
PC2
Principal Components
Class 1
samples
Class 2
samples
PCA: PC selection
– It is often common to look at the first two
principal components, and figure out if they make
sense biologically
– Here we take a more nuanced approach
• We use the comparison between eigenvalues from real
data PCA and randomized data PCA to evaluate which
components are “significant”
• There is really no space to delve the mechanics of this,
just use the R functions provided later
PCA: note on PC selection
Eigenvalue Diagram
Real Data Eigenvalue
Random Control
Eigenvalue
Principal Component #
PCA: note on PC selection
Eigenvalue Diagram
Highly
Significant
(PC 1, 2)
Real Data Eigenvalue
Random Control
Eigenvalue
Borderline
(PC 3, 4)
Not significant
(PC ≥ 5)
Principal Component #
PCA: Case Study 1
– The estrogen treatment of MCF-7 (GSE11352) breast cancer cell
line is an example of PCA results that fit very well with the
design expectations
Selected for
visualization
PCA: Case Study 1
– The estrogen treatment of MCF-7 (GSE11352) breast cancer cell
line is an example of PCA results that fit pretty well with the
design expectations
Treatment Separation
Untreated
Treated
PCA: Case Study 1
– Looking at the time trajectories, we notice that both treated
and untreated undergo changes
– It seems possible to separate common time-dependent changes
and treatment-dependent changes
48h
48h
24h
24h
12h
12h
PCA: Case Study 1
– Classes form distinct sample groups: intra-class variability is less
than inter-class variability (variability is proportional to distance
among samples)
PCA: Case Study 1
Comparison to Hierarchical Clustering
– Although single classes appear well separated both in clustering and
PCA, the PCA biplot is more informative
– This is usually true with designs that have two orthogonal factors, or
with any type of time series
PCA: Case Study 2
– This data-set is composed of colonic mucosa samples from
control and early onset colorectal cancer patients
– Unlike cell lines (such as MCF-7), surgical samples tend to be
more noisy due to tissue heterogeneity and contaminants, and
also exhibit population variability
– The PCA reveals that there is a good separation between
cancers and controls, however cancers display a relevant
internal structure
– The analytical methods you will learn in this course work
perfectly for data-sets like case 1, but they encounter
shortcomings for more complex data-sets like this one
• Consulting an expert biostatistician is recommended in such cases
PCA: Case Study 2
– PC 1, 2, 3 are likely significant; since exploring PC-3 we could
scarcely make sense of the patterns, only PC 1 and 2 are depicted
Selected for
visualization
Control
Colon
Cancer
PCA: Case Study 2
Notice the increased
variability of the cancer class,
which is well expected owing
to genetic instability, and the
presence of different cancer
invasion stages and severity
degrees in the cancer patient
cohort
PCA: Case Study 3
– GSE8866 data-set is composed of IMR-32 neuroblastoma
cell line response to the silencing of oncogenes Cyclin D1
(CCND1) and CDK4
– Two major groups are expected:
• Wild types and mock-transfected
• CCND1 and CDK4-silenced (possible internal structure)
– Whereas PC-2 exactly explains this pattern, PC-1 explains
the differentiation of single biological replicates from
every class from the bulk of the samples
• This is likely due to some systematic experimental bias
PCA: Case Study 3
– In such cases, it is recommended either to
• revisit the experiment, and if the experimental bias can be found
(e.g. a different protocol), remove the affected samples
• Use a paired test when computing the differential statistics (this
will not be covered in this course, you may have to consult an
expert biostatistician)
– It is not recommended to proceed with the analysis as
nothing had happened
– Don’t remove samples if you don’t have a strong
experimental / biological rationale! I.e. never remove
samples just because “they don’t look good”
PCA: Case Study 3
PCA: Case Study 3
Mock-transfected
neuroblastoma
WT neuroblastoma
CCND1 silenced
CDK4 silenced
PCA: Case Study 3
Mock-transfected
neuroblastoma
WT neuroblastoma
CCND1 silenced
CDK4 silenced
Single samples that
may be affected by
experimental biases
PCA: Case Study 4
– GSE4498 samples were generated by fiberoptic
bronchoscopy and brushing of human small airway
bronchial epithelium to evaluate differences between
phenotypically normal smokers and non-smokers
– This data-set is problematic: PC-1, which is the most
informative component, displays lots of confusion
between the two classes
• Only a subset of smokers are separated from non-smokers
– In such cases, it is important to explore the data-set also
with other methods (hierarchical clustering), check if
confusion persists, revisit the design and consider a paired
test for differentiality
• Again, consulting an expert biostatistician is recommended
PCA: Case Study 4
Note: PC-2 is not very significant, but was
used anyway; PCA results interpretation
should anyway focus on PC-1
Significant
Borderline
Smokers (SK) and Non-smokers (NS)
are not well distinguished
PCA: beyond biplots
– A big limitation in the visualization of PCA results is the
limitation of the human observer to 2D spaces (until we
start using holograms, like in Star Trek, to get 3D graphics)
– In relatively simple microarray experiments, with a few
experimental factors, only 1 or 2 PCs are really significant,
hence a single biplot often suffices
– In case you have more than 2 significant PCs:
• Explore combinations of 2 PCs at a time
• Use a 3D plot; since 3D plots end up being 2D on screens and
paper, tune the projection parameters so that the most significant
component is least affected by the projection
• Other more advanced applications
PCA: R Code
– Running PCA and eigenvalue selection procedure
library (ade4)
es.pca <- dudi.pca (
df = t (es_log2.mx),
nf = ncol (es_log2.mx),
scannf = FALSE
)
source ("f.EigenRand")
es_eig.ls <- f.EigenRand (
x.mx
= es_log2.mx,
eig.nv
= es.pca$eig,
title.ch = "Estrogen Treatment PCA",
perm.n
= 2
)
PCA: R Code
– PCA Biplot: color by class data
pca_color.df <- data.frame (
col = c (
rep ("orange",
3),
rep ("lightgreen",3),
rep ("red",
3),
rep ("green2",
3),
rep ("brown",
3),
rep ("green3",
3)
),
stringsAsFactors = F)
rownames (pca_color.df) <- colnames (es_log2.mx)
PCA: R Code
– PCA Biplot: plot
source ("f.PCA_plot")
f.PCA_plot (
dudi.pca
title.ch
cp.nv
col.df
text.bn
)
=
=
=
=
=
es.pca,
"Estrogen Treatment PCA",
c (1, 2),
pca_color.df,
F
# text.bn set to F does not print sample labels
– The functions f.PCA_plot and f.EigenRand are available on the
course site
PCA and Hierarchical Clustering
Further Readings
– Patrik D’haeseleer
How does gene expression clustering work?
Nature Biotechnology, 2005
– Markus Ringnér
What is principal component analysis?
Nature Biotechnology, 2008
(PDF available on the course site)
Microarray Data Analysis (I)
Part 5 / 5
Database and Analysis Resources
GEO (Gene Expression Omnibus)
Bioconductor
Bioconductor
www.bioconductor.org/
– Open source project for the analysis of genomic
data: collection of R libraries developed by
research groups worldwide
• Analysis tools (statistical models and graphics)
• Annotation data
– Documentation
http://www.bioconductor.org/docs
Follow this to browse
available analysis
packages
Follow this to browse
available annotation
packages
Batch Install Bioconductor Packages
– Core packages
source ("http://bioconductor.org/biocLite.R")
biocLite ()
– Packages of choice
source ("http://bioconductor.org/biocLite.R")
biocLite ("package")
– After installing, remember you have to load the package using
library (package)
Gene Expression Omnibus (GEO)
http://www.ncbi.nlm.nih.gov/geo/
NCBI repository of microarray gene expression experiments
Use this to search data-sets
- Platform ID
- Experimental factor or
biological sample type
keyword
Gene Expression Omnibus (GEO)
Gene Expression Omnibus (GEO)
Essential information
on samples
Pre-processed data
CEL Files if available
Gene Expression Omnibus (GEO)
Most common Affymetrix platforms identifiers
Transcript Arrays
– Human
• GPL570 HG-U133 plus 2.0 (full genome coverage)
• GPL96 HG-U133A
(partial genome coverage)
– Mouse
• GPL1261 Mouse430_2
• GPL339 MOE430A
(full genome coverage)
(partial genome coverage)
Querying GEO using GEOquery
– The R / Bioconductor package GEOquery can be
used to import pre-processed tables form GEO
directly in the ExpressionSet object format
http://www.bioconductor.org/packages/1.8/bioc/html/GEOquery.html
– If you find GEOquery too hard to use, you can still download the preprocessed data in Series matrix file (txt) format; this is composed of:
• Several lines of meta-data
• The expression matrix
the simplest way to import it is to remove the metadata and then read
the matrix in R using read.table

Introduction to R

Transcript Introduction to R

Directory