Affymetrix and BioConductor
Download
Report
Transcript Affymetrix and BioConductor
Affymetrix and BioConductor
Johannes Freudenberg
Cincinnati Children’s Hospital Medical Center
[email protected]
Overview
Affymetrix' high-density oligonucleotide microarrays
GeneChip® Technology
Terminology
Spotted vs. Affymetrix Arrays
Preprocessing Affymetrix MA data using BioConductor
Overview
Background correction
Normalization
PM correction
Summarization
Many preprocessing strategies available – which one
to use?
Affymetrix & Limma
7/18/2015
Johannes Freudenberg, CCHMC
2
Affymetrix GeneChip®
(=0.5 inch)
Dudoit et al., 2002
7/18/2015
Johannes Freudenberg, CCHMC
3
Affymetrix GeneChip® technology
Lipshutz et al., 1999
7/18/2015
Johannes Freudenberg, CCHMC
4
Affymetrix GeneChip® technology (2)
Original Publication
(Lockhart et al., 1996)
Affymetrix (http://www.affymetrix.com/technology/ge_analysis/index.affx)
7/18/2015
Johannes Freudenberg, CCHMC
5
Affymetrix GeneChip® technology (3)
Lipshutz et al., 1999
7/18/2015
Johannes Freudenberg, CCHMC
6
Terminology
Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer.
Perfect match (PM): A 25-mer complementary to a reference
sequence of interest (gene, ).
Mismatch (MM): same as PM but with base change for the
middle (13th) base (transversion purine <-> pyrimidine, G <>C, A <->T) .
The purpose of the MM probe design is to measure non-specific
binding and background noise.
Probe-pair: a (PM,MM) pair.
Probe set: a collection of probe-pairs (11 to 20) related to a
common gene or EST.
AffyID: an identifier for a probe-pair set (eg. “A28102_at”)
CEL file: text (or binary) file containing raw probe intensities for
a single chip created by MAS (GCOS) software
MAS: Microarray Suite – Affymetrix software package
GCOS: GeneChip operating software – new Affymetrix software
Adapted from
Dudoit et al., 2002
7/18/2015
Johannes Freudenberg, CCHMC
7
Affymetrix vs. other technologies
Feature
Affymetrix
Other technologies
Probe synthesis
- In-situ synthesized
- Spotted cDNA
- Oligos attached to beads
Probes length
- 25-mers
- 50-mers
- Varying lengths
Probe density
- High density
- Low density
Probes per gene
- 11-20 different probe pairs
- One probe
- Multiple identical probes
Labeling
- Biotin
- Cy3, Cy5, etc.
Hybridization
- One target per spot
- Competitive
7/18/2015
Johannes Freudenberg, CCHMC
8
What’s the evidence?
Expectation/ Hope(?):
measured hybridization
intensity proportional
to # of mRNA
transcripts
Spike-in experiment
seems to confirm this
However
Only ten genes
Selection of these
genes?
Lockhart et al., 1996
7/18/2015
Johannes Freudenberg, CCHMC
9
Affymetrix chips – Summary
Biological
Sample
Extracted
mRNA
Labeled
mRNA
Hybridization,
scanning & image
processing
Data
(pre-)processing
CEL files
7/18/2015
Johannes Freudenberg, CCHMC
10
Preprocessing for Affymetrix GeneChips®
CEL files
within-chip
cross-chip
sequence
specific
background
correction
within-probe set
aggregation of
intensity values
Two “standard” methods
MAS 5.0 (now GCOS/GDAS) by Affymetrix
RMA by Speed group (UC Berkeley)
7/18/2015
Johannes Freudenberg, CCHMC
11
affy – BioConductor library
“extensible, interactive environment for data
analysis and exploration of Affymetrix
oligonucleotide array probe level data”
(www.bioconductor.org)
Contains functions to
Load
Store
Plot, and
Preprocess Affymetrix microarray data
Many other related packages
affycomp, affydata, affyPLM, annaffy, gcrma,
makecdfenv, matchprobes, simpleaffy, vsn, …
7/18/2015
Johannes Freudenberg, CCHMC
12
affy – BioConductor library
… if you haven‘t already done that
Download BioConductor install script
source("http://www.bioconductor.org/biocLite.R")
(or source("http://www.bioconductor.org/getBioC.R"))
Run the script
biocLite() (or getBioC())
Load the “affy” library
library(affy)
Load our example data (3+3 subset from
Bhattacharjee et al., 2001)
Run the following script to download the six CEL files (*)
source("http://eh3.uc.edu/affy/downloadAffy.R")
Load CEL files into R
harvard.rawData <- ReadAffy()
If you prefer interactive
harvard.rawData <- ReadAffy(widget = T)
7/18/2015
Johannes Freudenberg, CCHMC
(*)
Note: You may
have to change your
working directory to a
path where you have
writing permission
13
What is an AffyBatch object?
Take a first look at the experiment data
harvard.rawData
Experiment data is stored in an AffyBatch
object
slotNames(harvard.rawData)
"cdfName" – GeneChip version
"nrow", "ncol" – size of the chip (usually 640x640)
"exprs" – expression matrix, contains all PM and
MM intensities of the experiment
"phenoData" – experiment annotation
"description", "annotation" – more annotation slots
…
7/18/2015
Johannes Freudenberg, CCHMC
14
Accessing an AffyBatch object
Extracting
the expression matrix
(*)
Don’t try this at home…
exprs(harvard.rawData)(*) or
intensity(harvard.rawData)(*)
PM values
pm(harvard.rawData)[1:10,]
MM values
mm(harvard.rawData)[1:10,]
probe names
probeNames(harvard.rawData)[1:10]
gene names
geneNames(harvard.rawData)[1:10]
a probe set
probeset(harvard.rawData, "100_g_at")
the type of GeneChip
cdfName(harvard.rawData)
…
7/18/2015
Johannes Freudenberg, CCHMC
15
How does my data look? –
Diagnostic plot I
Plot an image of an array
image(harvard.rawData[,1])
7/18/2015
Johannes Freudenberg, CCHMC
16
How does my data look? –
Diagnostic plot II
Plot the intensity distribution
hist(harvard.rawData, main = "Harvard data")
7/18/2015
Johannes Freudenberg, CCHMC
17
How does my data look? –
Diagnostic plot III
MvA plots
MAplot(harvard.rawData)
7/18/2015
Johannes Freudenberg, CCHMC
18
How does my data look?
Plot a probe set
plot(probeset(harvard.rawData,
geneNames(harvard.rawData)[1])[[1]])
7/18/2015
Johannes Freudenberg, CCHMC
19
How does my data look?
Plot a probe set
par(mfrow = c(2, 3))
barplot(probeset(harvard.rawData,geneNames(harvard.rawData)[1])[[1]])
7/18/2015
Johannes Freudenberg, CCHMC
20
MAS 5.0 - Background correction
Intended to correct for optical
effects
Divide array into K zones
(default K = 16)
Lowest 2% of the intensities in
zone k are used to compute
background bZk and noise nZk of
zone k
Background b(x, y) of cell (x, y)
is the weighted sum of all bZk
Noise n(x,y) is computed
likewise
Background corrected intensity
A(x,y) = max(I(x,y) – b(x,y),
0.5*n(x,y))
7/18/2015
Johannes Freudenberg, CCHMC
(Affymetrix, 2002)
21
MAS 5.0 - Background correction
harvard.mas5BG <- bg.correct(harvard.rawData,"mas")
hist(harvard.mas5BG, main = "Harvard data after MAS
5.0 BG correction")
7/18/2015
Johannes Freudenberg, CCHMC
22
MAS 5.0 - Background correction
par(mfrow = c(2,3))
MAplot(harvard.mas5BG)
7/18/2015
Johannes Freudenberg, CCHMC
23
RMA Background correction
S observed PM intensity
Model: S sum of
“true” signal X and
background signal Y
S = X + Y, where X ~ Exp(), Y~ N(,²)
independent random variables
X
Y
+
S
=
(Speed, 2002)
7/18/2015
Johannes Freudenberg, CCHMC
24
RMA Background correction
E(X | S = s)
Correct for
background
by replacing
S with
E(X | S = s)
To do that
estimate
, ,
from data
S
Let a = s - - ²
b=
a s a a
sa
E(X| S s) a b
/
1
b
7/18/2015
b b
Johannes Freudenberg, CCHMC
b
25
RMA - Background correction
harvard.rmaBG <- bg.correct(harvard.rawData,"rma")
hist(harvard.rmaBG, main = "Harvard data after RMA
BG correction")
7/18/2015
Johannes Freudenberg, CCHMC
26
RMA - Background correction
par(mfrow = c(2,3))
MAplot(harvard.rmaBG)
7/18/2015
Johannes Freudenberg, CCHMC
27
Why should we normalize?
… to remove chip effects
7/18/2015
Johannes Freudenberg, CCHMC
28
MAS 5.0 – Global Constant Normalization
Global constant normalization: Statistical parameters are
used as global (= per-chip) scaling factor, such as:
Sum
Median, Quantiles/Percentiles
Mean (also trimmed mean, asymmetric trimmed
mean)
Normalization transforms data – afterwards parameter is
equal for all chips
Intensity independent normalization(!)
MAS 5.0: 2% trimmed mean m (default = 500) on the
linear scale (as opposed to the log scale)
Note: In MAS 5.0, this step is done after summarization
7/18/2015
Johannes Freudenberg, CCHMC
29
MAS 5.0 – Global Constant Normalization
harvard.mas5norm <- normalize(harvard.mas5BG, "constant")
hist(harvard.mas5norm, main = "Harvard data after MAS 5.0
BG & normalization")
7/18/2015
Johannes Freudenberg, CCHMC
30
MAS 5.0 – Global Constant Normalization
par(mfrow = c(2,3))
MAplot(harvard.mas5norm)
7/18/2015
Johannes Freudenberg, CCHMC
31
RMA – Quantile normalization
Assumption: True intensity
distributions identical in
all replicate samples
Then all points lie on the
diagonal in a Q-Q plot
Works the same way
in higher dimensions
If observed intensity
distribution are different
“force” them to be equal
7/18/2015
Johannes Freudenberg, CCHMC
32
RMA – Quantile normalization
harvard.rmaNorm <- normalize(harvard.rmaBG, "quantiles")
hist(harvard.rmaNorm, main = "Harvard data after RMA BG &
normalization")
7/18/2015
Johannes Freudenberg, CCHMC
33
RMA – Quantile normalization
par(mfrow = c(2,3))
MAplot(harvard.rmaNorm)
7/18/2015
Johannes Freudenberg, CCHMC
34
PM correction
Background correction intended to adjust signal for
non-specific contributions such as
unspecific binding
cross-hybridization
auto- fluorescence of the surface
Traditional approach cannot be used with highdensity arrays
Therefore, Affymetrix invented MMs, which are
designed to measure non-specific signal contributions
Original idea: “true” signal = PM - MM
Problems:
PM < MM in approx. 30% of all probe pairs
MM signals really are specific (but less sensitive)
7/18/2015
Johannes Freudenberg, CCHMC
35
MAS 5.0 – PM correction
In MAS 5.0 “ideal” mismatch (IM) is computed
IM = MM
if MM is “well-behaved” (i.e. MM < PM)
IM = down-scaled MM
if MM > PM but most other MMs of the
corresponding probeset are “well-behaved”
IM PM
if most MMs are “defective”
Adjusted probe value is PM – IM
… for a more detailed description refer to
Affymetrix’ Statistical Algorithms Description
Document
7/18/2015
Johannes Freudenberg, CCHMC
36
MAS 5.0 – PM correction
harvard.mas5PMcorr <- pmcorrect.mas(harvard.mas5norm)
plot(log2(pm(harvard.mas5norm[,6])),log2(harvard.mas5PMcor
r[,6]), pch = ".", xlab = "PM", ylab = "corrected probe
value", main = "Harvard data after MAS 5.0 PM correction")
7/18/2015
Johannes Freudenberg, CCHMC
37
PM correction?
MMs are greater
than corresponding
PMs 30% of all
probe pairs
Therefore, RMA
does not do PM
correction
RMA uses only PMs
and ignores MMs
7/18/2015
Johannes Freudenberg, CCHMC
38
PM correction?
Hybridization behavior/
labeling efficiency
depends on middle base
Chudin et al., 2001
MM signals are specific
(but less sensitive)
7/18/2015
Johannes Freudenberg, CCHMC
39
PM correction!
Intensities depend on
position of A, C, G, and
T, respectively
Wu et al., 2004
Intensities depend on
number of A, C, G, and
T, respectively
New model based approach – GCRMA (Wu et al., 2004)
7/18/2015
Johannes Freudenberg, CCHMC
40
Summarization
Purpose
for every probe set (i.e. gene/ EST), merge the 16 to
20 probe values into a single expression value
MAS 5.0
Computes Tukey’s Bi-Weight (a robust weighted
mean) for every probe set
Does not “borrow” information from other arrays
RMA
Performs Median-Polish (a robust method to fit a
linear model similar to a two-way ANOVA model)
Information is “borrowed” from other arrays
Note: Reported RMA expression values are on log2
scale!
7/18/2015
Johannes Freudenberg, CCHMC
41
Summarization
harvard.mas5expr<-computeExprSet(harvard.mas5norm,"mas","mas")
harvard.rmaExpr <-computeExprSet(harvard.rmaNorm,"pmonly", "medianpolish")
plot(log2(exprs(harvard.mas5expr)[,1]), log2(exprs(harvard.mas5expr)[,4]),
xlab = "Adeno 1", ylab = "Normal 1", main = "MAS 5")
plot(exprs(harvard.rmaExpr)[,1], exprs(harvard.rmaExpr)[,6], xlab = "Adeno
1", ylab = "Normal 1", main = "RMA")
7/18/2015
Johannes Freudenberg, CCHMC
42
expresso – does it all at once
GCRMA
Example:
harvard.exprData <- expresso(harvard.rawData,
bgcorrect.method = "rma", normalize.method = "constant",
pmcorrect.method = "pmonly", summary.method = "avgdiff")
Result is an “expression set” which is a subclass of “AffyBatch”
7/18/2015
Johannes Freudenberg, CCHMC
43
Shortcuts
Some of the most widely used strategies
have wrapper functions
MAS 5.0
harvard.mas5 <- mas5(harvard.rawData)
RMA
harvard.rma <- rma(harvard.rawData)
GCRMA
library(gcrma)
harvard.gcrma <- gcrma(harvard.rawData)
7/18/2015
Johannes Freudenberg, CCHMC
44
What preprocessing strategy to use?
… No one knows
But there is evidence that
MAS 5.0 is not a good idea
RMA is a much better alternative
Other, model-based approaches work well (e.g.
GCRMA, VSN)
You can evaluate your favorite strategy using
benchmark data
Spike-In experiments (Affymetrix, GeneLogic)
Dilution experiment (GeneLogic)
7/18/2015
Johannes Freudenberg, CCHMC
45
affycomp
affycomp is an R package to evaluate your
favorite preprocessing strategy using publicly
available benchmark data
It’s also a website
(http://affycomp.biostat.jhsph.edu/)
69 different strategies submitted so far
Limitations of benchmark data(?)
Unrealistically high data quality
Unrealistic set-up
“Over-fitting”(?)
7/18/2015
Johannes Freudenberg, CCHMC
46
Take-home messages
Affymetrix microarrays are in-situ synthesized
oligonucleotide arrays
Each gene is represented by a set of perfect
match probes and mismatch probes
Raw probe intensities require preprocessing
Background correction (optional)
Normalization (recommended)
PM correction using MM probes (optional)
Summarization (required)
BioConductor’s affy package offers the
necessary tools for handling Affymetrix data
It really does matter which strategy you use!
7/18/2015
Johannes Freudenberg, CCHMC
47
Affymetrix and Limma
Limma can handle Affy data
Just to double check…
> harvard.rma@phenoData@pData
sample
adeno1.CEL
1
adeno2.CEL
2
adeno3.CEL
3
normal1.CEL
4
normal2.CEL
5
normal3.CEL
6
Load the library…
> library(limma)
Please read the user’s guide…
> limmaUsersGuide()
7/18/2015
Johannes Freudenberg, CCHMC
49
Limma can handle Affy data
Need to define regression model…
> design <model.matrix(~1+factor(c(0, 0, 0, 1, 1, 1)))
> colnames(design) <- c("adeno", "normVsCa")
> design
adeno normVsCa
1
1
0
2
1
0
3
1
0
4
1
1
5
1
1
6
1
1
First coefficient
estimates mean log-expression for adeno carcinoma and
plays the role of an intercept
Second coefficient
estimates difference between carcinoma and normal tissue
7/18/2015
Johannes Freudenberg, CCHMC
50
Limma can handle Affy data
Differentially expressed genes can be
found by
> fit <- lmFit(harvard.rma,
design)
> fit <- eBayes(fit)
> topTable(fit, 30, coef =
"normVsCa", adjust = "BH")
7/18/2015
Johannes Freudenberg, CCHMC
51
lmFit – linear model fit
Fits gene expression data to a linear
regression model such as
y = sytematic effect1
+ sytematic effect2
+…
+ random effect
Goal is to separate random variation from
systematic effects
Example
expr(genei) = baseline expri + carcinoma effecti + εij
εij ~ N(μi, σi)
7/18/2015
Johannes Freudenberg, CCHMC
52
lmFit – linear model fit
7/18/2015
Johannes Freudenberg, CCHMC
53
eBayes – empirical Bayes statistics
for differential expression
Mario will talk about this in detail but briefly
Underestimated variance leads to overestimated tstatistic which leads to false positives
Overestimated variance leads to underestimated tstatistic which leads to false negatives
eBayes improves t-statistic by replacing traditional
variance estimate with modified variance estimate
which “borrows” information from other genes
7/18/2015
Johannes Freudenberg, CCHMC
54
Visualize top 300 genes
heatmap() visualizes
intensities and clusters
genes and samples
> m <- topTable(fit,
300, coef="normVsCa",
adjust="BH")
> index <as.numeric(row.names(m))
> heatmap(-exprs(
harvard.rma)[index,])
7/18/2015
Johannes Freudenberg, CCHMC
55
References
Affymetrix, Statistical Algorithms Description Document. 2002, Affymetrix, Inc.: Santa Clara,
CA. http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf.
Bhattacharjee, A, et al. Classification of human lung carcinomas by mRNA expression
profiling reveals distinct adenocarcinoma subclasses. PNAS. 98 (24), 13790-13795,
November 2001.
BioConductor Vignettes. http://www.bioconductor.org/docs/vignettes.html.
BioConductor. Limma: Linear Models for Microarray Data User’s Guide.
http://www.bioconductor.org/.
Chudin E, Walker R, Kosaka A, Wu SX, Rabert D, Chang TK, Kreder DE. Assessment of the
relationship between signal intensities and transcript concentration for Affymetrix GeneChip
arrays. Genome Biol. 2002;3(1):RESEARCH0005. Epub 2001 Dec 14.
Dudoit S, Gentleman R, Irizarry R, Yang YH. Introduction to DNA Microarray Technologies.
Bioconductor Short Course, Winter 2002.
http://www.bioconductor.org/workshops/2002/Seattle02/MarrayTech.pdf.
Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide
arrays. Nat Genet. 1999 Jan; 21(1 Suppl):20-4.
Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C,
Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density
oligonucleotide arrays. Nat Biotechnol. 1996 Dec;14(13):1675-80
Speed, T. (2002). Summarizing and comparing GeneChip® data. Presentation at Affymetrix
User Meting 7.06.2002. http://stat-www.berkeley.edu/users/terry/zarray/Talks/affyusers.ppt
Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F. A Model Based Background
Adjustment for Oligonucleotide Expression Arrays. May 28, 2004. Johns Hopkins University,
Dept. of Biostatistics Working Papers. Working Paper 1.
http://www.bepress.com/jhubiostat/paper1
7/18/2015
Johannes Freudenberg, CCHMC
56