Affymetrix and BioConductor

Download Report

Transcript Affymetrix and BioConductor

Affymetrix and BioConductor
Johannes Freudenberg
Cincinnati Children’s Hospital Medical Center
[email protected]
Overview
 Affymetrix' high-density oligonucleotide microarrays
 GeneChip® Technology
 Terminology
 Spotted vs. Affymetrix Arrays
 Preprocessing Affymetrix MA data using BioConductor





Overview
Background correction
Normalization
PM correction
Summarization
 Many preprocessing strategies available – which one
to use?
 Affymetrix & Limma
7/18/2015
Johannes Freudenberg, CCHMC
2
Affymetrix GeneChip®
(=0.5 inch)
Dudoit et al., 2002
7/18/2015
Johannes Freudenberg, CCHMC
3
Affymetrix GeneChip® technology
Lipshutz et al., 1999
7/18/2015
Johannes Freudenberg, CCHMC
4
Affymetrix GeneChip® technology (2)
Original Publication
(Lockhart et al., 1996)
Affymetrix (http://www.affymetrix.com/technology/ge_analysis/index.affx)
7/18/2015
Johannes Freudenberg, CCHMC
5
Affymetrix GeneChip® technology (3)
Lipshutz et al., 1999
7/18/2015
Johannes Freudenberg, CCHMC
6
Terminology
 Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer.
 Perfect match (PM): A 25-mer complementary to a reference
sequence of interest (gene, ).
 Mismatch (MM): same as PM but with base change for the
middle (13th) base (transversion purine <-> pyrimidine, G <>C, A <->T) .
The purpose of the MM probe design is to measure non-specific
binding and background noise.
 Probe-pair: a (PM,MM) pair.
 Probe set: a collection of probe-pairs (11 to 20) related to a
common gene or EST.
 AffyID: an identifier for a probe-pair set (eg. “A28102_at”)
 CEL file: text (or binary) file containing raw probe intensities for
a single chip created by MAS (GCOS) software
 MAS: Microarray Suite – Affymetrix software package
 GCOS: GeneChip operating software – new Affymetrix software
Adapted from
Dudoit et al., 2002
7/18/2015
Johannes Freudenberg, CCHMC
7
Affymetrix vs. other technologies
Feature
Affymetrix
Other technologies
Probe synthesis
- In-situ synthesized
- Spotted cDNA
- Oligos attached to beads
Probes length
- 25-mers
- 50-mers
- Varying lengths
Probe density
- High density
- Low density
Probes per gene
- 11-20 different probe pairs
- One probe
- Multiple identical probes
Labeling
- Biotin
- Cy3, Cy5, etc.
Hybridization
- One target per spot
- Competitive
7/18/2015
Johannes Freudenberg, CCHMC
8
What’s the evidence?
 Expectation/ Hope(?):
measured hybridization
intensity proportional
to # of mRNA
transcripts
 Spike-in experiment
seems to confirm this
 However
 Only ten genes
 Selection of these
genes?
Lockhart et al., 1996
7/18/2015
Johannes Freudenberg, CCHMC
9
Affymetrix chips – Summary
Biological
Sample
Extracted
mRNA
Labeled
mRNA
Hybridization,
scanning & image
processing
Data
(pre-)processing
CEL files
7/18/2015
Johannes Freudenberg, CCHMC
10
Preprocessing for Affymetrix GeneChips®
CEL files
within-chip
cross-chip
sequence
specific
background
correction
within-probe set
aggregation of
intensity values
Two “standard” methods
 MAS 5.0 (now GCOS/GDAS) by Affymetrix
 RMA by Speed group (UC Berkeley)
7/18/2015
Johannes Freudenberg, CCHMC
11
affy – BioConductor library
 “extensible, interactive environment for data
analysis and exploration of Affymetrix
oligonucleotide array probe level data”
(www.bioconductor.org)
 Contains functions to




Load
Store
Plot, and
Preprocess Affymetrix microarray data
 Many other related packages
 affycomp, affydata, affyPLM, annaffy, gcrma,
makecdfenv, matchprobes, simpleaffy, vsn, …
7/18/2015
Johannes Freudenberg, CCHMC
12
affy – BioConductor library
 … if you haven‘t already done that
 Download BioConductor install script
source("http://www.bioconductor.org/biocLite.R")
(or source("http://www.bioconductor.org/getBioC.R"))
 Run the script
biocLite() (or getBioC())
 Load the “affy” library
library(affy)
 Load our example data (3+3 subset from
Bhattacharjee et al., 2001)
 Run the following script to download the six CEL files (*)
source("http://eh3.uc.edu/affy/downloadAffy.R")
 Load CEL files into R
harvard.rawData <- ReadAffy()
 If you prefer interactive
harvard.rawData <- ReadAffy(widget = T)
7/18/2015
Johannes Freudenberg, CCHMC
(*)
Note: You may
have to change your
working directory to a
path where you have
writing permission
13
What is an AffyBatch object?
 Take a first look at the experiment data
harvard.rawData
 Experiment data is stored in an AffyBatch
object
slotNames(harvard.rawData)
 "cdfName" – GeneChip version
 "nrow", "ncol" – size of the chip (usually 640x640)
 "exprs" – expression matrix, contains all PM and
MM intensities of the experiment
 "phenoData" – experiment annotation
 "description", "annotation" – more annotation slots
 …
7/18/2015
Johannes Freudenberg, CCHMC
14
Accessing an AffyBatch object
 Extracting
 the expression matrix
(*)
Don’t try this at home…
exprs(harvard.rawData)(*) or
intensity(harvard.rawData)(*)
 PM values
pm(harvard.rawData)[1:10,]
 MM values
mm(harvard.rawData)[1:10,]
 probe names
probeNames(harvard.rawData)[1:10]
 gene names
geneNames(harvard.rawData)[1:10]
 a probe set
probeset(harvard.rawData, "100_g_at")
 the type of GeneChip
cdfName(harvard.rawData)
 …
7/18/2015
Johannes Freudenberg, CCHMC
15
How does my data look? –
Diagnostic plot I
 Plot an image of an array
image(harvard.rawData[,1])
7/18/2015
Johannes Freudenberg, CCHMC
16
How does my data look? –
Diagnostic plot II
 Plot the intensity distribution
hist(harvard.rawData, main = "Harvard data")
7/18/2015
Johannes Freudenberg, CCHMC
17
How does my data look? –
Diagnostic plot III
 MvA plots
MAplot(harvard.rawData)
7/18/2015
Johannes Freudenberg, CCHMC
18
How does my data look?
 Plot a probe set
plot(probeset(harvard.rawData,
geneNames(harvard.rawData)[1])[[1]])
7/18/2015
Johannes Freudenberg, CCHMC
19
How does my data look?
 Plot a probe set
par(mfrow = c(2, 3))
barplot(probeset(harvard.rawData,geneNames(harvard.rawData)[1])[[1]])
7/18/2015
Johannes Freudenberg, CCHMC
20
MAS 5.0 - Background correction
 Intended to correct for optical
effects
 Divide array into K zones
(default K = 16)
 Lowest 2% of the intensities in
zone k are used to compute
background bZk and noise nZk of
zone k
 Background b(x, y) of cell (x, y)
is the weighted sum of all bZk
 Noise n(x,y) is computed
likewise
 Background corrected intensity
A(x,y) = max(I(x,y) – b(x,y),
0.5*n(x,y))
7/18/2015
Johannes Freudenberg, CCHMC
(Affymetrix, 2002)
21
MAS 5.0 - Background correction
harvard.mas5BG <- bg.correct(harvard.rawData,"mas")
hist(harvard.mas5BG, main = "Harvard data after MAS
5.0 BG correction")
7/18/2015
Johannes Freudenberg, CCHMC
22
MAS 5.0 - Background correction
par(mfrow = c(2,3))
MAplot(harvard.mas5BG)
7/18/2015
Johannes Freudenberg, CCHMC
23
RMA Background correction
 S observed PM intensity
 Model: S sum of
 “true” signal X and
 background signal Y
 S = X + Y, where X ~ Exp(), Y~ N(,²)
independent random variables
X
Y
+
S
=
(Speed, 2002)
7/18/2015
Johannes Freudenberg, CCHMC
24
RMA Background correction
E(X | S = s)
 Correct for
background
by replacing
S with
E(X | S = s)
 To do that
estimate
, , 
from data
S
 Let a = s -  - ²
b=
  a   s  a    a 
 sa 
 E(X| S  s)  a  b     
 /    
  1
 b
7/18/2015
 b    b 
Johannes Freudenberg, CCHMC
 b 

25
RMA - Background correction
harvard.rmaBG <- bg.correct(harvard.rawData,"rma")
hist(harvard.rmaBG, main = "Harvard data after RMA
BG correction")
7/18/2015
Johannes Freudenberg, CCHMC
26
RMA - Background correction
par(mfrow = c(2,3))
MAplot(harvard.rmaBG)
7/18/2015
Johannes Freudenberg, CCHMC
27
Why should we normalize?
 … to remove chip effects
7/18/2015
Johannes Freudenberg, CCHMC
28
MAS 5.0 – Global Constant Normalization
 Global constant normalization: Statistical parameters are
used as global (= per-chip) scaling factor, such as:
 Sum
 Median, Quantiles/Percentiles
 Mean (also trimmed mean, asymmetric trimmed
mean)
 Normalization transforms data – afterwards parameter is
equal for all chips
 Intensity independent normalization(!)
 MAS 5.0: 2% trimmed mean m (default = 500) on the
linear scale (as opposed to the log scale)
 Note: In MAS 5.0, this step is done after summarization
7/18/2015
Johannes Freudenberg, CCHMC
29
MAS 5.0 – Global Constant Normalization
harvard.mas5norm <- normalize(harvard.mas5BG, "constant")
hist(harvard.mas5norm, main = "Harvard data after MAS 5.0
BG & normalization")
7/18/2015
Johannes Freudenberg, CCHMC
30
MAS 5.0 – Global Constant Normalization
par(mfrow = c(2,3))
MAplot(harvard.mas5norm)
7/18/2015
Johannes Freudenberg, CCHMC
31
RMA – Quantile normalization
 Assumption: True intensity
distributions identical in
all replicate samples
 Then all points lie on the
diagonal in a Q-Q plot
 Works the same way
in higher dimensions
 If observed intensity
distribution are different
“force” them to be equal
7/18/2015
Johannes Freudenberg, CCHMC
32
RMA – Quantile normalization
harvard.rmaNorm <- normalize(harvard.rmaBG, "quantiles")
hist(harvard.rmaNorm, main = "Harvard data after RMA BG &
normalization")
7/18/2015
Johannes Freudenberg, CCHMC
33
RMA – Quantile normalization
par(mfrow = c(2,3))
MAplot(harvard.rmaNorm)
7/18/2015
Johannes Freudenberg, CCHMC
34
PM correction
 Background correction intended to adjust signal for
non-specific contributions such as
 unspecific binding
 cross-hybridization
 auto- fluorescence of the surface
 Traditional approach cannot be used with highdensity arrays
 Therefore, Affymetrix invented MMs, which are
designed to measure non-specific signal contributions
 Original idea: “true” signal = PM - MM
 Problems:
 PM < MM in approx. 30% of all probe pairs
 MM signals really are specific (but less sensitive)
7/18/2015
Johannes Freudenberg, CCHMC
35
MAS 5.0 – PM correction
 In MAS 5.0 “ideal” mismatch (IM) is computed
 IM = MM
if MM is “well-behaved” (i.e. MM < PM)
 IM = down-scaled MM
if MM > PM but most other MMs of the
corresponding probeset are “well-behaved”
 IM  PM
if most MMs are “defective”
 Adjusted probe value is PM – IM
 … for a more detailed description refer to
Affymetrix’ Statistical Algorithms Description
Document
7/18/2015
Johannes Freudenberg, CCHMC
36
MAS 5.0 – PM correction
harvard.mas5PMcorr <- pmcorrect.mas(harvard.mas5norm)
plot(log2(pm(harvard.mas5norm[,6])),log2(harvard.mas5PMcor
r[,6]), pch = ".", xlab = "PM", ylab = "corrected probe
value", main = "Harvard data after MAS 5.0 PM correction")
7/18/2015
Johannes Freudenberg, CCHMC
37
PM correction?
 MMs are greater
than corresponding
PMs 30% of all
probe pairs
 Therefore, RMA
does not do PM
correction
 RMA uses only PMs
and ignores MMs
7/18/2015
Johannes Freudenberg, CCHMC
38
PM correction?
 Hybridization behavior/
labeling efficiency
depends on middle base
Chudin et al., 2001
 MM signals are specific
(but less sensitive)
7/18/2015
Johannes Freudenberg, CCHMC
39
PM correction!
 Intensities depend on
position of A, C, G, and
T, respectively
Wu et al., 2004
 Intensities depend on
number of A, C, G, and
T, respectively
 New model based approach – GCRMA (Wu et al., 2004)
7/18/2015
Johannes Freudenberg, CCHMC
40
Summarization
 Purpose
 for every probe set (i.e. gene/ EST), merge the 16 to
20 probe values into a single expression value
 MAS 5.0
 Computes Tukey’s Bi-Weight (a robust weighted
mean) for every probe set
 Does not “borrow” information from other arrays
 RMA
 Performs Median-Polish (a robust method to fit a
linear model similar to a two-way ANOVA model)
 Information is “borrowed” from other arrays
 Note: Reported RMA expression values are on log2
scale!
7/18/2015
Johannes Freudenberg, CCHMC
41
Summarization
harvard.mas5expr<-computeExprSet(harvard.mas5norm,"mas","mas")
harvard.rmaExpr <-computeExprSet(harvard.rmaNorm,"pmonly", "medianpolish")
plot(log2(exprs(harvard.mas5expr)[,1]), log2(exprs(harvard.mas5expr)[,4]),
xlab = "Adeno 1", ylab = "Normal 1", main = "MAS 5")
plot(exprs(harvard.rmaExpr)[,1], exprs(harvard.rmaExpr)[,6], xlab = "Adeno
1", ylab = "Normal 1", main = "RMA")
7/18/2015
Johannes Freudenberg, CCHMC
42
expresso – does it all at once
GCRMA
 Example:
harvard.exprData <- expresso(harvard.rawData,
bgcorrect.method = "rma", normalize.method = "constant",
pmcorrect.method = "pmonly", summary.method = "avgdiff")
 Result is an “expression set” which is a subclass of “AffyBatch”
7/18/2015
Johannes Freudenberg, CCHMC
43
Shortcuts
 Some of the most widely used strategies
have wrapper functions
 MAS 5.0
harvard.mas5 <- mas5(harvard.rawData)
 RMA
harvard.rma <- rma(harvard.rawData)
 GCRMA
library(gcrma)
harvard.gcrma <- gcrma(harvard.rawData)
7/18/2015
Johannes Freudenberg, CCHMC
44
What preprocessing strategy to use?
 … No one knows
 But there is evidence that
 MAS 5.0 is not a good idea
 RMA is a much better alternative
 Other, model-based approaches work well (e.g.
GCRMA, VSN)
 You can evaluate your favorite strategy using
benchmark data
 Spike-In experiments (Affymetrix, GeneLogic)
 Dilution experiment (GeneLogic)
7/18/2015
Johannes Freudenberg, CCHMC
45
affycomp
 affycomp is an R package to evaluate your
favorite preprocessing strategy using publicly
available benchmark data
 It’s also a website
(http://affycomp.biostat.jhsph.edu/)
 69 different strategies submitted so far
 Limitations of benchmark data(?)
 Unrealistically high data quality
 Unrealistic set-up
 “Over-fitting”(?)
7/18/2015
Johannes Freudenberg, CCHMC
46
Take-home messages
 Affymetrix microarrays are in-situ synthesized
oligonucleotide arrays
 Each gene is represented by a set of perfect
match probes and mismatch probes
 Raw probe intensities require preprocessing




Background correction (optional)
Normalization (recommended)
PM correction using MM probes (optional)
Summarization (required)
 BioConductor’s affy package offers the
necessary tools for handling Affymetrix data
 It really does matter which strategy you use!
7/18/2015
Johannes Freudenberg, CCHMC
47
Affymetrix and Limma
Limma can handle Affy data
 Just to double check…
> harvard.rma@phenoData@pData
sample
adeno1.CEL
1
adeno2.CEL
2
adeno3.CEL
3
normal1.CEL
4
normal2.CEL
5
normal3.CEL
6
 Load the library…
> library(limma)
 Please read the user’s guide…
> limmaUsersGuide()
7/18/2015
Johannes Freudenberg, CCHMC
49
Limma can handle Affy data
 Need to define regression model…
> design <model.matrix(~1+factor(c(0, 0, 0, 1, 1, 1)))
> colnames(design) <- c("adeno", "normVsCa")
> design
adeno normVsCa
1
1
0
2
1
0
3
1
0
4
1
1
5
1
1
6
1
1
First coefficient
 estimates mean log-expression for adeno carcinoma and
 plays the role of an intercept
Second coefficient
 estimates difference between carcinoma and normal tissue
7/18/2015
Johannes Freudenberg, CCHMC
50
Limma can handle Affy data
 Differentially expressed genes can be
found by
> fit <- lmFit(harvard.rma,
design)
> fit <- eBayes(fit)
> topTable(fit, 30, coef =
"normVsCa", adjust = "BH")
7/18/2015
Johannes Freudenberg, CCHMC
51
lmFit – linear model fit
 Fits gene expression data to a linear
regression model such as
 y = sytematic effect1
+ sytematic effect2
+…
+ random effect
 Goal is to separate random variation from
systematic effects
 Example
 expr(genei) = baseline expri + carcinoma effecti + εij
εij ~ N(μi, σi)
7/18/2015
Johannes Freudenberg, CCHMC
52
lmFit – linear model fit
7/18/2015
Johannes Freudenberg, CCHMC
53
eBayes – empirical Bayes statistics
for differential expression
 Mario will talk about this in detail but briefly
 Underestimated variance leads to overestimated tstatistic which leads to false positives
 Overestimated variance leads to underestimated tstatistic which leads to false negatives
 eBayes improves t-statistic by replacing traditional
variance estimate with modified variance estimate
which “borrows” information from other genes
7/18/2015
Johannes Freudenberg, CCHMC
54
Visualize top 300 genes
 heatmap() visualizes
intensities and clusters
genes and samples
> m <- topTable(fit,
300, coef="normVsCa",
adjust="BH")
> index <as.numeric(row.names(m))
> heatmap(-exprs(
harvard.rma)[index,])
7/18/2015
Johannes Freudenberg, CCHMC
55
References










Affymetrix, Statistical Algorithms Description Document. 2002, Affymetrix, Inc.: Santa Clara,
CA. http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf.
Bhattacharjee, A, et al. Classification of human lung carcinomas by mRNA expression
profiling reveals distinct adenocarcinoma subclasses. PNAS. 98 (24), 13790-13795,
November 2001.
BioConductor Vignettes. http://www.bioconductor.org/docs/vignettes.html.
BioConductor. Limma: Linear Models for Microarray Data User’s Guide.
http://www.bioconductor.org/.
Chudin E, Walker R, Kosaka A, Wu SX, Rabert D, Chang TK, Kreder DE. Assessment of the
relationship between signal intensities and transcript concentration for Affymetrix GeneChip
arrays. Genome Biol. 2002;3(1):RESEARCH0005. Epub 2001 Dec 14.
Dudoit S, Gentleman R, Irizarry R, Yang YH. Introduction to DNA Microarray Technologies.
Bioconductor Short Course, Winter 2002.
http://www.bioconductor.org/workshops/2002/Seattle02/MarrayTech.pdf.
Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide
arrays. Nat Genet. 1999 Jan; 21(1 Suppl):20-4.
Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C,
Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density
oligonucleotide arrays. Nat Biotechnol. 1996 Dec;14(13):1675-80
Speed, T. (2002). Summarizing and comparing GeneChip® data. Presentation at Affymetrix
User Meting 7.06.2002. http://stat-www.berkeley.edu/users/terry/zarray/Talks/affyusers.ppt
Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F. A Model Based Background
Adjustment for Oligonucleotide Expression Arrays. May 28, 2004. Johns Hopkins University,
Dept. of Biostatistics Working Papers. Working Paper 1.
http://www.bepress.com/jhubiostat/paper1
7/18/2015
Johannes Freudenberg, CCHMC
56