Probe Level Analysis of AffymetrixTM Data
Download
Report
Transcript Probe Level Analysis of AffymetrixTM Data
Probe Level
Analysis of
TM
Affymetrix Data
Mark Reimers, NCI
Outline
Design of Affy probesets
Background
Normalization
Non-specific hybridization
Estimation
Comparison of Methods
®
Affymetrix GeneChip Probe
Arrays
Hybridized Probe Cell
GeneChip Probe Array
Single stranded, fluorescently
labeled DNA target
*
*
*
*
*
Oligonucleotide probe
20µm
1.28cm
Each probe cell or feature contains
millions of copies of a specific
oligonucleotide probe
Over 400,000 different probes
complementary to genetic
information of interest
Image of Hybridized Probe Array
Affymetrix Probe Design
Published 5´
Gene Sequence
3´
Multiple (11-20) 25-base
oligonucleotide probes
Perfect Match
Mismatch
PM is exactly complementary to published sequence
MM is changed on 13th base
Chip Layout
Typical chips are square:
640x640 (U95A), 712x712
(U133) or 1042x1042 (Plus2)
Older chips placed all probes
for one gene in a row
Modern chips distribute probes
according to sequence, not
gene
Chip Nomenclature
HGU133A - Human Genome: Unigene build 133, first
chip
PM - ‘perfect match’
MM - ‘mismatch’
Control sequence
Signal - intensity
sequence from unrelated organism
Doesn’t translate directly to abundance
Cross-hybridization
Binding of sequences other than target
Affymetrix Background
Adjustment and
Normalization
What’s the Issue?
Background: some Affy chips show
consistently higher values for the lowest
signals (presumably absent) than others
Background may vary over a chip
Normalization: Distribution of probe
signals may differ between chips,
independent of background adjustment
PM and MM may be shifted differently
Probe Intensities in 23 Replicates
Approaches to Background
Subtract common estimate of background
Fit local background across chip and
subtract - MAS 5.0
Consider background as random variable
Use statistical theory to derive background
correction
RMA ‘Bayesian’ BG
Correction
Each S = BG + Intensity + e
BG randomly sampled from Normal distn
Intensity randomly sampled from exponential
distribution
Estimate mean and SD of BG distn by
fitting values below mode of signal distn
Estimate Intensity, conditional on S, by
integrating over possible values of BG
I
0
(S x)dN , (x) K
Approaches to Normalization
Simple: find average of each chip; divide
all values by chip average
MAS5: trimmed mean
Invariant set: find subset of probes in
almost same rank order in each chip
Quantile normalization: fit to average
quantiles across experiment
Probes on Different Chips
Plots of two Affymetrix chips against the experiment means
MAS 5.0
Plot probes
from each
chip against
common
base-line chip
Fit regression
line to middle
98% of probes
Invariant Set (Li-Wong)
Method
Select baseline chip X
For each other chip Y:
Select probes p1, …, pK, (K ~ 10000),
such that p1 < p2 < …< pK in both chips
Fit running median through points
{ (xp1,yp1), …, (xpK, ypK) }
Repeat
Quantile Method (RMA)
Distributions of probe intensities vary
substantially among replicate chips
This cannot be even approximately
resolved by any linear transformation
Drastic solution: ‘shoehorn’ all probe
intensities into same distribution
Ideal distribution is taken as average of all
Quantile Distribution
Normalization
of
Reference
Chip Intensities
Distribution
Formula:
xnorm = F2-1(F1(x))
Density
function
Assumes:
gene distribution
changes little
F1(x)
Cumulative
Distribution
Function
F2(x)
a
x
y
Ratio-Intensity: Before
Ratio-Intensity: After
Critique of RMA Normalization
Distribution of signals looks more like
exponential on log scale
No allowance for regional biases in BG
Quantile normalization is very strong:
highly expressed genes won’t be equal
Better to let higher end be roughly linear
Requires much memory - could be
implemented differently
Model-based
Estimates for
Affymetrix Raw Data
Many Probes for One Gene
Gene 5´
Sequence
3´
Multiple
oligo probes
Perfect Match
Mismatch
How to combine signals from multiple probes
into a single gene abundance estimate?
Probe Variation
Individual probes don’t agree on fold
changes
Probes for one gene may vary by two
orders of magnitude on each chip
CG content is most important factor in signal strength
Signal from 16 probes
along one gene on
one chip
Competing Models 2005
GCOS (Affymetrix MicroArray Suite 5.0)
dChip
Li and Wong, HSPH
Bioconductor: affy package (RMA)
Manufacturer’s software
Bolstad, Irizarry, Speed, et al
Variants such as gcRMA, vsn
Probe-level analyses
affyPLM, logit-t, …
Probe Measure Variation
•Typical probes are two orders of magnitude different!
•CG content is most important factor
•RNA target folding also affects hybridization
3x104
0
Principles of MAS 5 method
First estimate background
•bg = MM (if physically possible)
•log(bg) = log(PM)-log(non-specific proportion)
(if impossible)
•Non-specific proportion = max(SB, e)
•SB = Tukeybiweight(log(PM)-log(MM))
•Signal = Tukeybiweight(log(Adjusted PM))
Critique of MAS 5
principle
Not clear what an average of different
probes should mean
Tukey bi-weight can be unstable when
data cluster at either end – frequently the
conditions here
No ‘learning’ based on cross-chip
performance of individual probes
Motivation for multi-chip models:
Probe level data from spike-in study ( log scale )
note parallel trend of all probes
Courtesy of Terry Speed
Linear Models
Extension of linear regression
Essential features:
Measurement errors independent of each other
‘random noise’
Needs normalization to eliminate systematic variation
Noise levels comparable at different levels of signal
Small number of factors give predicted levels
combine in linear function or simple algebraic form
Model for Probe Signal
Each probe signal is proportional to
i) the amount of target sample – a
ii) the affinity of the specific probe sequence to the target – f
NB: High affinity is not the same as Specificity
Probe can give high signal to intended target and also to other
transcripts
Probes
1
2
3
chip 1
a1
chip 2
a2
f1 f2 f3
Multiplicative Model
For each gene, a set of probes p1,…,pk
Each probe pj binds the gene with
efficiency fj
In each sample there is an amount qi.
Probe intensity should be proportional to
fjxqi
Always some noise!
Robust Statistics
Outlier: a measure that is far beyond the typical random
variation
Robust methods try to fit the majority of data points
common in biological measures
10-15% in Affy probe sets
Issue is to identify which points to down-weight or ignore
Median is very robust – but inefficient
Trimmed means are almost as robust and much more efficient
Robust Linear Models
Criterion of fit
Least median squares
Sum of weighted squares
Least squares and throw out outliers
Method for finding fit
High-dimensional search
Iteratively re-weighted least squares
Median Polish
Why Robust Models for
GeneChips?
10% - 15% of individual signals in a probe
set deviate greatly from pattern
Often outliers lie close together
Causes:
Scratches
Proximity to heating elements
Uneven fluid flow
Li & Wong (dChip)
Model:
PMij = qifj + eij
- Original model (dChip 1.0) used PMij - MMij = qifj + eij
by analogy with Affy MAS 4
Outlier removal:
Fitting probes in one set on one chip
Identify extreme residuals
Remove
Re-fit
Iterate
Dark blue: PM values
Red: fitted values
Light blue: probe SD
Critique of Li-Wong model
Model assumes that noise for all probes
has same magnitude
All biological measurements exhibit
intensity-dependent noise
Bolstad, Irizarry, Speed –
(RMA)
For each probe set, take the log transform of
PMij = qifj:
log ( PM ij ) log( ai ) log( f j )
i.e. fit the model:
Fit this additive model by iteratively re-weighted least-squares or
ij
i
j
ij
median polish
Where nlog() stands for logarithm after normalization
nlog ( PM bg) a b e
Critique: assumes probe noise is
constant (homoschedastic) on log scale
Comparison of Methods
Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA
20 replicate arrays – variance should be small
Standard deviations of expression estimates on arrays
arranged in four groups of genes
Courtesy of Terry Speed
by increasing mean expression level
Steady Improvement
Affymetrix improves their model
MAS P & A calls reasonable
MAS 5.0 estimation does a reasonable job on probe sets
that are bright
PLIER is a multi-chip model
Abundant genes
dChip and RMA do better on genes that are less
abundant
Signalling proteins, transcription factors, etc
Expression Comparison 1 – MAS 4
Ratio-Intensity Plot
comparing two chips
from spike-in
experiment
White dots represent
unchanged genes
Red numbers flag
spike-in genes
Courtesy of Terry Speed
Expression Comparison 2 – MAS 5
t-scores
changed
genes
Theoretical
t-distribution
Expression Comparison 3 – Li-Wong
Courtesy of Terry Speed
Expression Comparison 4 - RMA
Courtesy of Terry Speed
Comparison on Real Data
These results are based on samples with 14
spike-ins - not realistic complexity
Choe et al (Genome Biology 2005) produced a
spike in data set with realistic complexity - found
MAS5 PM correction worked well
Comparisons of biological variation vs technical
variation in replicated samples suggest RMA
defaults work best
Mix and Match Methods in affy
Background: rma, mas
Normalization: quantile, constant, …
PM-correction: none,
Model: median polish, mas
Estimates <- expresso( cel.data,
bgcorrect.method = mas,
normalization.method = quantiles, …
gcRMA:
Estimating Non-specific Hybridization
Each probe has its own characteristic
cross-hybridizations (NSH)
Mismatch is not a good estimate of NSH
GC content may predict NSH reasonably
well