Practical Issues in Microarray Data Analysis
Download
Report
Transcript Practical Issues in Microarray Data Analysis
Practical Issues in
Microarray Data Analysis
Mark Reimers
National Cancer Institute
Bethesda Maryland
Overview
Scales for analysis
Systematic errors
Sample outliers & experimental consistency
Useful graphics
Implications for experimental design
Platform consistency
Individual differences
Distribution of Signals
•Most genes are expressed at very low levels
•Even after log-transform the distribution is skewed
•NB: Signal to abundance ratio NOT the same
for different genes on the chip
Explanation of Distribution Shape
Left hand steep bell curve probably due to
measurement noise
Underlying real distribution probably even steeper
+
abundances
+
=
noise
= observed values
Variation Between Chips
Technical variation: differences between
measures of transcript abundance in same
samples
Causes:
Sample preparation
Slide
Hybridization
Measurement
Individual variation: variation between samples
or individuals
Healthy individuals really do have consistently
different levels of gene expression!
Replicates in True Scale
Signals vary more between replicates at high end
Level of ‘noise’ increases with signal
Comparison of chips (Affy)
Std Dev as a function of signal
across all chips
chip 1
SD
chip 2
mean signal
Red line is lowess fit
Replicates on Log Scale
Measures fold-change identically across genes
Noise at lower end is higher in log transform
chip 1 vs chip 2
after log transform
SD vs signal
after log transform
Ratio-Intensity (R-I) plots
Log scale makes it convenient to represent fold-
changes up or down symmetrically
R = log(Red/Green); I = (1/2)log(Red*Green)
aka. MA (minus, add) plots
(log)
Ratio
(log) Intensity
Variance Stabilization
Simple power transforms (Box-Cox) often nearly
stabilize variance
Durbin and Huber derived variance-stabilizing transform
from a theoretical model:
y = a (background) + m eh (mult. error) + e (static error)
m is true signal; h and e have N(0,s) distribution
Transform:
(
2
2
2
g
(
y
)
log
y
a
(
y
a
)
s
s
e
h
Could estimate a (background) and sh/se empirically
In practice often best effect on variance comes from
parameters different from empirical estimates
Huber’s harder to estimate
Box-Cox Transforms
•Simple power transformations (including log
as extreme case), eg cube root
•Often work almost as well as variancestabilizing transform
Should you use Transforms?
Transforms change the list of genes that are
differentially regulated
The common argument is that bright genes have
higher variability
However you aren’t comparing different genes
Log transform expands the variability of repressed
genes
Strong transforms (eg log) most suitable for situations
where large fold-changes occur (eg. Cancers)
Weak transforms more suited for situations where
small changes are of interest (eg. Neurobiology)
Graphical methods
Aims:
Exploratory analysis, to see natural groupings, and to
detect outliers
To identify combinations of features that usefully
characterize samples or genes
Not really suitable for quantitative measures of
confidence
Principal Components Analysis (PCA)
Standard procedure of finding combinations with
greatest variance
Multi-dimensional scaling (MDS)
Represent distances between samples as a two- or
three-dimensional distance
Easy to visualize
MDS Plots
Representing Groups
Day 1 Chips
Cluster diagram
Multi-dimensional scaling
Different Metrics – Same Scale
8 tumor; 2
normal tissue
samples
Distances are
similar in each
tree
Normals close
Tree topologies
appear different
Take with a grain
of salt!
Volcano Plot
Displays both
biological
importance and
statistical
significance
log2(p-value)
or t-score
log2(fold change)
Quantile Plot
scores against tscores under
random
hypothesis
Statistically
significant genes
stand out
Sample t-scores
Plot sample t-
Corresponding quantiles of t-distribution
Systematic Variation
Intensity-dependent dye bias due to
‘quenching’
Stringency (specificity) of hybridization due to
ionic strength of hyb solution
How far hybridization reaction progresses
due to variation in mixing efficiency
Spatial variation in all of the above
Relevance for Experimental Designs
Balanced designs with several replicates built
in have smaller standard errors than
reference design with same number of chips
– Kerr & Churchill
Assuming error is random!
In practice very hard to
Sample 1
deal with systematic
Sample 5
Sample 2 errors in a symmetric
design
Sample 4
Sample 3
No two slides with
comparable foldchanges
Critique of Optimal Designs
Optimal for reduction of variance, if
All chips are good quality
No systematic errors – only random noise
In fact systematic error is almost as great as
random noise in many microarray experiments
With loop designs single chip failures cause
more loss of information than with reference
designs
Individual Variation
Numerous genes show high levels of inter-
individual variation
Level of variation depends on tissue also
Donors, or experimental animals may be
infected, or under social stress
Tissues are hypoxic or ischemic for variable
times before freezing
Frequent False Positives
Immuno-globulins, and stress response
proteins often 5-10X higher than typical in
one or two samples
Permutation p-values will be insignificant,
even if t-score appears large
Group 1
Group 2
frequency
gene levels