Practical Issues in Microarray Data Analysis

Download Report

Transcript Practical Issues in Microarray Data Analysis

Practical Issues in
Microarray Data Analysis
Mark Reimers
National Cancer Institute
Bethesda Maryland
Overview
 Scales for analysis
 Systematic errors



Sample outliers & experimental consistency
Useful graphics
Implications for experimental design
 Platform consistency
 Individual differences
Distribution of Signals
•Most genes are expressed at very low levels
•Even after log-transform the distribution is skewed
•NB: Signal to abundance ratio NOT the same
for different genes on the chip
Explanation of Distribution Shape
 Left hand steep bell curve probably due to
measurement noise
 Underlying real distribution probably even steeper
+
abundances
+
=
noise
= observed values
Variation Between Chips
 Technical variation: differences between
measures of transcript abundance in same
samples

Causes:




Sample preparation
Slide
Hybridization
Measurement
 Individual variation: variation between samples
or individuals

Healthy individuals really do have consistently
different levels of gene expression!
Replicates in True Scale
 Signals vary more between replicates at high end
 Level of ‘noise’ increases with signal
Comparison of chips (Affy)
Std Dev as a function of signal
across all chips
chip 1
SD
chip 2
mean signal
Red line is lowess fit
Replicates on Log Scale
 Measures fold-change identically across genes
 Noise at lower end is higher in log transform
chip 1 vs chip 2
after log transform
SD vs signal
after log transform
Ratio-Intensity (R-I) plots
 Log scale makes it convenient to represent fold-
changes up or down symmetrically
 R = log(Red/Green); I = (1/2)log(Red*Green)
 aka. MA (minus, add) plots
(log)
Ratio
(log) Intensity
Variance Stabilization
 Simple power transforms (Box-Cox) often nearly
stabilize variance
 Durbin and Huber derived variance-stabilizing transform
from a theoretical model:
 y = a (background) + m eh (mult. error) + e (static error)
 m is true signal; h and e have N(0,s) distribution

Transform:
(

2
2
2
g
(
y
)

log
y

a

(
y

a
)

s
s
e
h
 Could estimate a (background) and sh/se empirically
 In practice often best effect on variance comes from
parameters different from empirical estimates

Huber’s harder to estimate
Box-Cox Transforms
•Simple power transformations (including log
as extreme case), eg cube root
•Often work almost as well as variancestabilizing transform
Should you use Transforms?
 Transforms change the list of genes that are
differentially regulated
 The common argument is that bright genes have
higher variability

However you aren’t comparing different genes
 Log transform expands the variability of repressed
genes
 Strong transforms (eg log) most suitable for situations
where large fold-changes occur (eg. Cancers)
 Weak transforms more suited for situations where
small changes are of interest (eg. Neurobiology)
Graphical methods
 Aims:
 Exploratory analysis, to see natural groupings, and to
detect outliers
 To identify combinations of features that usefully
characterize samples or genes
 Not really suitable for quantitative measures of
confidence
 Principal Components Analysis (PCA)
 Standard procedure of finding combinations with
greatest variance
 Multi-dimensional scaling (MDS)
 Represent distances between samples as a two- or
three-dimensional distance
 Easy to visualize
MDS Plots
Representing Groups
Day 1 Chips
Cluster diagram
Multi-dimensional scaling
Different Metrics – Same Scale
 8 tumor; 2
normal tissue
samples
 Distances are
similar in each
tree

Normals close
 Tree topologies
appear different
 Take with a grain
of salt!
Volcano Plot
 Displays both
biological
importance and
statistical
significance
log2(p-value)
or t-score
log2(fold change)
Quantile Plot
scores against tscores under
random
hypothesis
 Statistically
significant genes
stand out
Sample t-scores
 Plot sample t-
Corresponding quantiles of t-distribution
Systematic Variation
 Intensity-dependent dye bias due to
‘quenching’
 Stringency (specificity) of hybridization due to
ionic strength of hyb solution
 How far hybridization reaction progresses
due to variation in mixing efficiency
 Spatial variation in all of the above
Relevance for Experimental Designs
 Balanced designs with several replicates built
in have smaller standard errors than
reference design with same number of chips
– Kerr & Churchill
 Assuming error is random!
 In practice very hard to
Sample 1
deal with systematic
Sample 5
Sample 2 errors in a symmetric
design

Sample 4
Sample 3
No two slides with
comparable foldchanges
Critique of Optimal Designs
 Optimal for reduction of variance, if



All chips are good quality
No systematic errors – only random noise
In fact systematic error is almost as great as
random noise in many microarray experiments
 With loop designs single chip failures cause
more loss of information than with reference
designs
Individual Variation
 Numerous genes show high levels of inter-
individual variation
 Level of variation depends on tissue also
 Donors, or experimental animals may be
infected, or under social stress
 Tissues are hypoxic or ischemic for variable
times before freezing
Frequent False Positives
 Immuno-globulins, and stress response
proteins often 5-10X higher than typical in
one or two samples
 Permutation p-values will be insignificant,
even if t-score appears large
Group 1
Group 2
frequency
gene levels