Probe Level Analysis of AffymetrixTM Data

Transcript Probe Level Analysis of AffymetrixTM Data

Probe Level
Analysis of
TM
Affymetrix Data
Mark Reimers, NCI
Outline






Design of Affy probesets
Background
Normalization
Non-specific hybridization
Estimation
Comparison of Methods
®
Affymetrix GeneChip Probe
Arrays
Hybridized Probe Cell
GeneChip Probe Array
Single stranded, fluorescently
labeled DNA target
*
*
*
*
*
Oligonucleotide probe
20µm
1.28cm
Each probe cell or feature contains
millions of copies of a specific
oligonucleotide probe
Over 400,000 different probes
complementary to genetic
information of interest
Image of Hybridized Probe Array
Affymetrix Probe Design
Published 5´
Gene Sequence
3´
Multiple (11-20) 25-base
oligonucleotide probes
Perfect Match
Mismatch
PM is exactly complementary to published sequence
MM is changed on 13th base
Chip Layout



Typical chips are square:
640x640 (U95A), 712x712
(U133) or 1042x1042 (Plus2)
Older chips placed all probes
for one gene in a row
Modern chips distribute probes
according to sequence, not
gene
Chip Nomenclature




HGU133A - Human Genome: Unigene build 133, first
chip
PM - ‘perfect match’
MM - ‘mismatch’
Control sequence


Signal - intensity


sequence from unrelated organism
Doesn’t translate directly to abundance
Cross-hybridization

Binding of sequences other than target
Affymetrix Background
Adjustment and
Normalization
What’s the Issue?

Background: some Affy chips show
consistently higher values for the lowest
signals (presumably absent) than others


Background may vary over a chip
Normalization: Distribution of probe
signals may differ between chips,
independent of background adjustment

PM and MM may be shifted differently
Probe Intensities in 23 Replicates
Approaches to Background



Subtract common estimate of background
Fit local background across chip and
subtract - MAS 5.0
Consider background as random variable

Use statistical theory to derive background
correction
RMA ‘Bayesian’ BG
Correction

Each S = BG + Intensity + e
BG randomly sampled from Normal distn
 Intensity randomly sampled from exponential
distribution



Estimate mean and SD of BG distn by
fitting values below mode of signal distn
Estimate Intensity, conditional on S, by
integrating over possible values of BG
I


0
(S  x)dN  , (x) K
Approaches to Normalization




Simple: find average of each chip; divide
all values by chip average
MAS5: trimmed mean
Invariant set: find subset of probes in
almost same rank order in each chip
Quantile normalization: fit to average
quantiles across experiment
Probes on Different Chips
Plots of two Affymetrix chips against the experiment means
MAS 5.0


Plot probes
from each
chip against
common
base-line chip
Fit regression
line to middle
98% of probes
Invariant Set (Li-Wong)
Method




Select baseline chip X
For each other chip Y:
Select probes p1, …, pK, (K ~ 10000),
such that p1 < p2 < …< pK in both chips
Fit running median through points


{ (xp1,yp1), …, (xpK, ypK) }
Repeat
Quantile Method (RMA)




Distributions of probe intensities vary
substantially among replicate chips
This cannot be even approximately
resolved by any linear transformation
Drastic solution: ‘shoehorn’ all probe
intensities into same distribution
Ideal distribution is taken as average of all
Quantile Distribution
Normalization
of
Reference
Chip Intensities
Distribution
Formula:
xnorm = F2-1(F1(x))
Density
function
Assumes:
gene distribution
changes little
F1(x)
Cumulative
Distribution
Function
F2(x)
a
x
y
Ratio-Intensity: Before
Ratio-Intensity: After
Critique of RMA Normalization





Distribution of signals looks more like
exponential on log scale
No allowance for regional biases in BG
Quantile normalization is very strong:
highly expressed genes won’t be equal
Better to let higher end be roughly linear
Requires much memory - could be
implemented differently
Model-based
Estimates for
Affymetrix Raw Data
Many Probes for One Gene
Gene 5´
Sequence
3´
Multiple
oligo probes
Perfect Match
Mismatch
How to combine signals from multiple probes
into a single gene abundance estimate?
Probe Variation


Individual probes don’t agree on fold
changes
Probes for one gene may vary by two
orders of magnitude on each chip

CG content is most important factor in signal strength
Signal from 16 probes
along one gene on
one chip
Competing Models 2005

GCOS (Affymetrix MicroArray Suite 5.0)


dChip


Li and Wong, HSPH
Bioconductor: affy package (RMA)



Manufacturer’s software
Bolstad, Irizarry, Speed, et al
Variants such as gcRMA, vsn
Probe-level analyses

affyPLM, logit-t, …
Probe Measure Variation
•Typical probes are two orders of magnitude different!
•CG content is most important factor
•RNA target folding also affects hybridization
3x104
0
Principles of MAS 5 method
First estimate background
•bg = MM (if physically possible)
•log(bg) = log(PM)-log(non-specific proportion)
(if impossible)
•Non-specific proportion = max(SB, e)
•SB = Tukeybiweight(log(PM)-log(MM))
•Signal = Tukeybiweight(log(Adjusted PM))
Critique of MAS 5



principle
Not clear what an average of different
probes should mean
Tukey bi-weight can be unstable when
data cluster at either end – frequently the
conditions here
No ‘learning’ based on cross-chip
performance of individual probes
Motivation for multi-chip models:
Probe level data from spike-in study ( log scale )
note parallel trend of all probes
Courtesy of Terry Speed
Linear Models


Extension of linear regression
Essential features:

Measurement errors independent of each other




‘random noise’
Needs normalization to eliminate systematic variation
Noise levels comparable at different levels of signal
Small number of factors give predicted levels

combine in linear function or simple algebraic form
Model for Probe Signal

Each probe signal is proportional to



i) the amount of target sample – a
ii) the affinity of the specific probe sequence to the target – f
NB: High affinity is not the same as Specificity

Probe can give high signal to intended target and also to other
transcripts
Probes
1
2
3
chip 1
a1
chip 2
a2
f1 f2 f3
Multiplicative Model





For each gene, a set of probes p1,…,pk
Each probe pj binds the gene with
efficiency fj
In each sample there is an amount qi.
Probe intensity should be proportional to
fjxqi
Always some noise!
Robust Statistics

Outlier: a measure that is far beyond the typical random
variation



Robust methods try to fit the majority of data points


common in biological measures
10-15% in Affy probe sets
Issue is to identify which points to down-weight or ignore
Median is very robust – but inefficient

Trimmed means are almost as robust and much more efficient
Robust Linear Models

Criterion of fit
Least median squares
 Sum of weighted squares
 Least squares and throw out outliers


Method for finding fit
High-dimensional search
 Iteratively re-weighted least squares
 Median Polish

Why Robust Models for
GeneChips?



10% - 15% of individual signals in a probe
set deviate greatly from pattern
Often outliers lie close together
Causes:
Scratches
 Proximity to heating elements
 Uneven fluid flow

Li & Wong (dChip)

Model:
PMij = qifj + eij
- Original model (dChip 1.0) used PMij - MMij = qifj + eij
by analogy with Affy MAS 4

Outlier removal:




Fitting probes in one set on one chip
Identify extreme residuals
Remove
Re-fit
Iterate
Dark blue: PM values
Red: fitted values
Light blue: probe SD
Critique of Li-Wong model


Model assumes that noise for all probes
has same magnitude
All biological measurements exhibit
intensity-dependent noise
Bolstad, Irizarry, Speed –
(RMA)

For each probe set, take the log transform of
PMij = qifj:
log ( PM ij )  log( ai )  log( f j )

i.e. fit the model:

Fit this additive model by iteratively re-weighted least-squares or
ij
i
j
ij
median polish
Where nlog() stands for logarithm after normalization
nlog ( PM  bg)  a  b  e
Critique: assumes probe noise is
constant (homoschedastic) on log scale
Comparison of Methods
Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA
20 replicate arrays – variance should be small
Standard deviations of expression estimates on arrays
arranged in four groups of genes
Courtesy of Terry Speed
by increasing mean expression level
Steady Improvement

Affymetrix improves their model



MAS P & A calls reasonable
MAS 5.0 estimation does a reasonable job on probe sets
that are bright


PLIER is a multi-chip model
Abundant genes
dChip and RMA do better on genes that are less
abundant

Signalling proteins, transcription factors, etc
Expression Comparison 1 – MAS 4
Ratio-Intensity Plot
comparing two chips
from spike-in
experiment
White dots represent
unchanged genes
Red numbers flag
spike-in genes
Courtesy of Terry Speed
Expression Comparison 2 – MAS 5
t-scores
changed
genes
Theoretical
t-distribution
Expression Comparison 3 – Li-Wong
Courtesy of Terry Speed
Expression Comparison 4 - RMA
Courtesy of Terry Speed
Comparison on Real Data



These results are based on samples with 14
spike-ins - not realistic complexity
Choe et al (Genome Biology 2005) produced a
spike in data set with realistic complexity - found
MAS5 PM correction worked well
Comparisons of biological variation vs technical
variation in replicated samples suggest RMA
defaults work best
Mix and Match Methods in affy





Background: rma, mas
Normalization: quantile, constant, …
PM-correction: none,
Model: median polish, mas
Estimates <- expresso( cel.data,
bgcorrect.method = mas,
normalization.method = quantiles, …
gcRMA:
Estimating Non-specific Hybridization



Each probe has its own characteristic
cross-hybridizations (NSH)
Mismatch is not a good estimate of NSH
GC content may predict NSH reasonably
well

Probe Level Analysis of AffymetrixTM Data

Transcript Probe Level Analysis of AffymetrixTM Data

Directory