Affymetrix Arrays

Download Report

Transcript Affymetrix Arrays

Analysis of Affymetrix
GeneChip Data
EPP 245/298
Statistical Analysis of
Laboratory Data
1
Basic Design of Expression Arrays
• For each gene that is a target for the array,
we have a known DNA sequence.
• mRNA is reverse transcribed to DNA, and
if a complementary sequence is on the on
a chip, the DNA will be more likely to stick
• The DNA is labeled with a dye that will
fluoresce and generate a signal that is
monotonic in the amount in the sample
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
2
Intron
Exon
TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG
ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC
Probe Sequence
•cDNA arrays use variable length probes
•Long oligoarrays use 60-70mers
•Affymetrix GeneChips use multiple 25-mers
•For each exon, 8-20 distinct probes
•May overlap
•May cover more than one exon
•Affymetrix chips also use mismatch (MM) probes that have the same sequence
as perfect match probes except for the middle base which is changed to inhibit
binding.
•This is supposed to act as a control, but often instead binds to another mRNA
species, so many analysts do not use them
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
3
Expression Indices
• A key issue with Affymetrix chips is how to
summarize the multiple data values on a
chip for each probe set (aka gene).
• There have been a large number of
suggested methods.
• Generally, the worst ones are those from
Affy, by a long way; worse means less
able to detect real differences
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
4
Usable Methods
• Li and Wong’s dCHIP and follow on work
is demonstrably better than MAS 4.0 and
MAS 5.0, but not as good as RMA and
GLA
• The RMA method of Irizarry et al. is
available in Bioconductor.
• The GLA method (Durbin, Rocke, Zhou) is
available in R code, soon in Bioconductor
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
5
Bioconductor Documentation
> library(affy)
Welcome to Bioconductor
Vignettes contain introductory material. To view,
simply type: openVignette()
For details on reading vignettes, see
the openVignette help page.
> openVignette()
Please select (by number) a vignette
1: affy primer
2: affy: Built-in Processing Methods
3: affy: Custom Processing Methods (HowTo)
4: affy: Automatic downloading of cdfenvs (HowTo):
5: affy: Import Methods (HowTo)
6: Biobase Primer
7: Howto Bioconductor
8: HowTo HowTo
9: esApply Introduction
Selection: 6
[1] TRUE
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
6
The exprSet class
• An object of class exprSet has several
slots
– exprs A matrix of expression levels with
chips as columns and “genes” as rows.
– se.exprs A matrix of standard errors for
exprs if available.
– phenoData An object of class phenoData
containing phenotype or experimental data.
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
7
– description A description of the
experiment, object of class MIAME
– annotation A character string indicating the
base name for the associated annotation
– notes Notes describing features or aspects
of the data, experiment, etc.
• The phenoData class has slots
– pData A dataframe where rows are cases
and columns are variables
– varLabels A list of labels and descriptions
for the variables
– The number of rows in pData is the same as
the number of columns in exprs.
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
8
> library(affy)
> data(geneData)
> class(geneData)
[1] "matrix"
> dim(geneData)
[1] 500 26
> data(geneCov)
> dim(geneCov)
[1] 26 3
> covN <- list(cov1 = "Covariate 1; 2 levels",cov2 = "Covariate 2; 2 levels",
+
cov3 <- "Covariate 3; 3 levels")
> pD <- new("phenoData",pData=geneCov, varLabels = covN)
> eSet <- new("exprSet",exprs=geneData,phenoData=pD)
> dim(eSet@exprs)
[1] 500 26
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
9
Reading Affy Data into R
• The CEL files contain the data from an
array. We will look at data from an older
type of array, the U95A which contains
12,625 probe sets and 409,600 probes.
• The CDF file contains information relating
probe pair sets to locations on the array.
These are built into the affy package for
standard types.
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
10
The ReadAffy function
• ReadAffy() function reads all of the CEL
files in the current working directory into
an object of class AffyBatch
• ReadAffy(widget=T) does so in a GUI
that allows entry of other characteristics of
the dataset
• You can also specify filenames, phenotype
or experimental data, and MIAME
information
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
11
> getClass("exprSet")
Slots:
Name:
Class:
exprs
exprMatrix
annotation
character
se.exprs
exprMatrix
phenoData
description
phenoData characterORMIAME
notes
character
> getClass("AffyBatch")
Slots:
Name:
Class:
cdfName
character
se.exprs
exprMatrix
nrow
numeric
ncol
numeric
phenoData
description
phenoData characterORMIAME
exprs
exprMatrix
annotation
character
notes
character
Extends: "exprSet"
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
12
A Sample Experiment
• This consists of 10 samples one each from
10 cell lines.
• The cell lines are of 5 types
• We have 10 Affymetrix U95A arrays
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
13
group <- data.frame(as.factor(rep(1:5,each=2)))
covN <- list(group = "Type of Cell Line")
pD <- new("phenoData",pData=group,varLabels=covN)
myMIAME <- new("MIAME",name="John Post, PhD",
lab="Jane Q. Investigator, MD",
contact="916-734-0000",
title="Sample Experiment for EPP 298")
Data <- ReadAffy(filenames=c("LN0A.CEL","LN0B.CEL","LN1A.CEL","LN1B.CEL",
"LN2A.CEL","LN2B.CEL","LN3A.CEL","LN3B.CEL","LN4A.CEL","LN4B.CEL"),
phenoData=pD,description=myMIAME)
> dim(Data@exprs)
[1] 409600
10
> Data@exprs[1:5,]
LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL LN4A.CEL LN4B.CEL
[1,]
137.0
103.0
120.0
133.0
124.0
188.0
183.0
143.0
117.0
93
[2,]
2484.0
2040.8
2065.3
1687.5
9454.5
7732.8
7813.5
9873.8
2286.8
1930
[3,]
274.0
146.0
145.0
133.0
159.0
190.0
185.0
153.0
133.0
141
[4,]
1966.8
2387.0
2465.0
1961.0
9753.0
8088.0
7047.0 14705.0
2593.0
2125
[5,]
71.5
51.5
65.0
58.5
73.3
68.3
66.3
87.3
58.8
53
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
14
> Data@phenoData
phenoData object with 1 variables and 10 cases
varLabels
group: Type of Cell Line
> Data@phenoData@pData
as.factor.rep.1.5..each...2..
1
1
2
1
3
2
4
2
5
3
6
3
7
4
8
4
9
5
10
5
> Data@description
Experimenter name: John Post, PhD
Laboratory: Jane Q. Investigator, MD
Contact information: 916-734-0000
Title: Sample Experiment for EPP 298
URL:
No abstract available.
Information is available on: preprocessing
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
15
Expression Indices
• The 409,600 rows of the expression matrix
in the AffyBatch object Data each
correspond to a probe (25-mer)
• Ordinarily to use this we need to combine
the probe level data for each probe set
into a single expression number
• This has conceptually several steps
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
16
Steps in Expression Index
Construction
• Background correction is the process of
adjusting the signals so that the zero point
is similar on all parts of all arrays.
• We like to manage this so that zero signal
after background correction corresponds
approximately to zero amount of the
mRNA species that is the target of the
probe set.
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
17
• Data transformation is the process of
changing the scale of the data so that it is
more comparable from high to low.
• Common transformations are the
logarithm and generalized logarithm
• Normalization is the process of adjusting
for systematic differences from one array
to another.
• Normalization may be done before or after
transformation, and before or after probe
set summarization.
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
18
• One may use only the perfect match (PM)
probes, or may subtract or otherwise use
the mismatch (MM) probes
• There are many ways to summarize 20
PM probes and 20 MM probes on 10
arrays (total of 200 numbers) into 10
expression index numbers
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
19
The RMA Method
• Background correction that does not make
0 signal correspond to 0 amount
• Quantile normalization
• Log2 transform
• Median polish summary of PM probes
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
20
> eidata <- rma(Data)
Background correcting
Normalizing
Calculating Expression
> class(eidata)
[1] "exprSet"
attr(,"package")
[1] "Biobase"
> dim(eidata@exprs)
[1] 12625
10
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
21
> summary(eidata@exprs)
LN0A.CEL
LN0B.CEL
Min.
: 2.724
Min.
: 2.598
1st Qu.: 4.489
1st Qu.: 4.469
Median : 6.090
Median : 6.071
Mean
: 6.135
Mean
: 6.139
3rd Qu.: 7.450
3rd Qu.: 7.485
Max.
:12.175
Max.
:12.258
LN2A.CEL
LN2B.CEL
Min.
: 2.622
Min.
: 2.743
1st Qu.: 4.461
1st Qu.: 4.474
Median : 6.016
Median : 6.067
Mean
: 6.124
Mean
: 6.139
3rd Qu.: 7.438
3rd Qu.: 7.434
Max.
:13.277
Max.
:13.346
LN4A.CEL
LN4B.CEL
Min.
: 2.775
Min.
: 2.667
1st Qu.: 4.483
1st Qu.: 4.449
Median : 6.077
Median : 6.053
Mean
: 6.136
Mean
: 6.134
3rd Qu.: 7.469
3rd Qu.: 7.488
Max.
:12.058
Max.
:12.153
October 27, 2004
LN1A.CEL
Min.
: 2.634
1st Qu.: 4.471
Median : 6.076
Mean
: 6.135
3rd Qu.: 7.474
Max.
:12.140
LN3A.CEL
Min.
: 2.687
1st Qu.: 4.433
Median : 6.021
Mean
: 6.130
3rd Qu.: 7.456
Max.
:13.267
EPP 245 Statistical Analysis of
Laboratory Data
LN1B.CEL
Min.
: 2.655
1st Qu.: 4.482
Median : 6.086
Mean
: 6.142
3rd Qu.: 7.476
Max.
:11.999
LN3B.CEL
Min.
: 2.655
1st Qu.: 4.447
Median : 6.034
Mean
: 6.132
3rd Qu.: 7.465
Max.
:13.287
22
The GLA Method
• The Glog Average (GLA) method is simpler than
the RMA method, though it can require
estimation of a parameter
• Background correction is intended to make a
measured value of zero correspond to a zero
quantity in the sample
• Transformation uses the glog ~ ln for large
values
• Normalization via lowess
• Summary is a simple average of PM probes
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
23
> source("gla.r")
> emat.gla <- gla(Data)
> class(emat.gla)
[1] "matrix"
> dim(emat.gla)
[1] 12625
10
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
24
> summary(emat.gla)
X1
X2
Min.
:3.859
Min.
:3.801
1st Qu.:4.948
1st Qu.:4.939
Median :5.456
Median :5.449
Mean
:5.569
Mean
:5.573
3rd Qu.:6.032
3rd Qu.:6.052
Max.
:8.794
Max.
:8.724
X5
X6
Min.
: 3.800
Min.
: 3.896
1st Qu.: 4.950
1st Qu.: 4.952
Median : 5.455
Median : 5.459
Mean
: 5.595
Mean
: 5.578
3rd Qu.: 6.058
3rd Qu.: 6.030
Max.
:10.083
Max.
:10.270
X9
X10
Min.
:3.907
Min.
:3.835
1st Qu.:4.949
1st Qu.:4.937
Median :5.452
Median :5.453
Mean
:5.569
Mean
:5.579
3rd Qu.:6.033
3rd Qu.:6.053
Max.
:8.772
Max.
:8.732
October 27, 2004
X3
Min.
:3.811
1st Qu.:4.942
Median :5.446
Mean
:5.569
3rd Qu.:6.038
Max.
:8.733
X7
Min.
:3.792
1st Qu.:4.945
Median :5.458
Mean
:5.597
3rd Qu.:6.062
Max.
:9.948
EPP 245 Statistical Analysis of
Laboratory Data
X4
Min.
:3.816
1st Qu.:4.942
Median :5.453
Mean
:5.566
3rd Qu.:6.034
Max.
:8.731
X8
Min.
:3.866
1st Qu.:4.936
Median :5.441
Mean
:5.578
3rd Qu.:6.036
Max.
:9.976
25
glog <- function(y,lambda)
{
yt <- log(y+sqrt(y^2+lambda))
return(yt)
}
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
26
lnorm<- function(mat1,span=.1)
{
print ("Start Lowess Normalization")
mat2 <- as.matrix(mat1)
p <- dim(mat2)[1]
n <- dim(mat2)[2]
rmeans <- apply(mat2,1,mean)
r<-rnorm(p,0,1e-7)
rmeans<-r+rmeans
rranks <- rank(rmeans)
matsort <- mat2[order(rranks),]
r0 <- 1:p
lcol <- function(x)
{
lx <- lowess(r0,x,f=span)$y
# returns vector of length p
}
lmeans <- apply(matsort,2,lcol) # column lowess
lgrand <- apply(lmeans,1,mean) # grand lowess
lgrand <- matrix(rep(lgrand,n),byrow=F,ncol=n) # grand lowess
matnorm0 <- matsort-lmeans+lgrand # normalized matrix
matnorm1 <- matnorm0[rranks,] # returns row order to original
print ("End Lowess Normalization")
return(matnorm1)
}
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
27
gla <- function(AB,lambda=1000,alpha=50)
{
gn <- geneNames(AB)
I <- length(gn)
#
# ind will be a vector with i repeated as many
# times as gn[i] has PM probes
# This shows which gene number is associated with each probe
#
ind<-vector()
for (i in 1:I)
{
ind <- c(ind, rep(i, dim(pm(AB,gn[i]))[1]))
}
mat <- pm(AB)
EI <- matrix(0, I, dim(mat)[2])
mat.glog<-lnorm(glog(mat-alpha,lambda))
for (i in 1:I)
{
tmp <- apply(mat.glog[ind==i,],2,mean)
EI[i,] <- t(tmp)
}
return(EI)
}
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
28
Probe Sets not Genes
• It is unavoidable to refer to a probe set as
measuring a “gene”, but nevertheless it can be
deceptive
• The annotation of a probe set may be based on
homology with a gene of possibly known
function in a different organism
• Only a relatively few probe sets correspond to
genes with known function and known structure
in the organism being studied
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
29
Exercise
• Download the eight arrays from the web
site, and the gla.r file
• Load the arrays into R using Read.Affy
and construct the RMA expression indices
• Source the gla.r functions, and construct
the GLA expression indices
October 27, 2004
EPP 245 Statistical Analysis of
Laboratory Data
30