S4_Bias_Normalisation

Download Report

Transcript S4_Bias_Normalisation

A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Analysis of
(cDNA) Microarray Data:
Part I. Sources of Bias and
Normalisation
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
MICROARRAY ANALYSIS
My (Educated?) View
1.
Data included in GEXEX
a. Whole data stored and “securely” available
b. GP3xCLI on each hybridisation
2.
Relaxed data acquisition criteria
a. Signal to Noise > 1.00 (relaxer (sp?) exist)
b. Mean to Median > 0.85 (Tran et al. 2002)
3.
Data Normalisation
4.
Mixed-Model Equations
a. Check Residuals (plot Residuals vs Predicted)
b. Check REML estimates of Variance Components
c. Proportion of Total Variance due to Gene x Variety
5.
Process Gene x Treatment BLUPs  Differentially Expressed Genes
a. t-statistics  Z-score  P-value
b. Mixtures of Distributions  Posterior Probabilities
6.
Process Differentially Expressed genes
a. Hierarchical clustering
b. Gene ontology analysis
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
MICROARRAY ANALYSIS
BASIC PIECES FOR SIGNAL DETECTION
• Foreground RED and GREEN
• Background RED and GREEN
Rf
Rb
• Background-corrected
RED
GREEN
R = Rf – Rb
G = Gf – Gb
• Log-transformed
Log2(R)
Log2(G)
• Difference: “Minus”
M = Log2(R) – Log2(G) = Log2(R/G)
• Mean: “Average”
A = 0.5 * ( Log2(R) + Log2(G) ) = 0.5 * Log2(R*G)
Gf
Gb
True Signals!
• MA-Plots …to come
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Acquisition Criteria
The Red/Green Intensities can be spatially biased
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Acquisition Criteria
The Red/Green Intensities can be intensity-biased
MA-Plot
Values should scatter around zero
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Background Correction: Why bother?
Data Acquisition Criteria
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Acquisition Criteria
Background Correction: Why bother?
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Acquisition Criteria
RED versus GREEN
Log-transformation: Why bother?
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Acquisition Criteria
MA-Plots: All versus only valid signals
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Acquisition Criteria
Signal to Noise Ratio
S 2N 
Fg  Bg
 Bg
Mean to Median Correlation
MinMean, Median
M 2M 
MaxMean, Median
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Normalisation
http://genome-www5.stanford.edu/mged/normalization.html
• Normalisation is an attempt to correct for systematic bias.
• Normalisation allows you to compare data from one array
to another.
• Systematic Bias can be introduced into microarray
experiments at all stages.
• Need to:
–
–
–
–
Avoid it (as much as possible)
Recognize it
Correct for it
Discard unrecoverable data
• In practice we do not always understand the data inevitably some biology will be removed too (or at least
not revealed).
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Normalisation
Source: Catherine Ball (Stanford)
Pool of Cell Lines
Tumor
Different amounts of
Differential
labeling
starting material.
efficiency of dyes
Different amounts of
Differential
RNA inefficiency
each channel
Differential
efficiency
of scanning
in each
of hybridization over
channel.
slide surface.
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Systematic Bias
Sources …
• Different labeling
efficiencies or dye
effects
• Scanner malfunction
• Differences in
concentration of DNA
on arrays (plate
effects)
• Printing or tip
problems
• Uneven hybridization
• Batch bias
• Experimenter issues
…and Dealing with it
• Detect and recognize the
effect  Note something
odd
• Determine magnitude and
effect on data  Try a few
methods
• Identify source of bias 
Think big!
• Eliminate or reduce
contributing factors
• Correct data
• Discard uncorrectable data
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Systematic Bias
Labeling Efficiencies Cause Bias
• One channel of a twochannel array has
higher intensity than the
other (usually GREEN).
• Most common source of
recognizable bias.
• Solution: Most easy to
addressed (eg. dyeswaps, balanced loops).
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Systematic Bias
Scanning (operator?) Bias
• Mis-aligned lasers can
cause big problems
• In this case, the two
channels are slightly
out of register
• Solution: fix the
scanner and repeat
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Systematic Bias
Printing (operator?) Bias
• Irregular shaped spots
are often observed
(printing error)
• Slides from the same
printing batch cluster
together
• Solution: Probably
limited to better printing
technique and image
analysis, rather than
normalization
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Systematic Bias
Probe Bias
• Different concentrations
of probes might produce
patterns in arrays
• Biological role of probes
can produce patterns in
arrays
• These patterns can
create a spatial bias that
are not artificial, but
biological
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Systematic Bias
Probe Bias
Coding regions
• Probes arranged on the
array based on
biological function
cause spatial bias
• Solution: avoid
arranging reporters
based on function,
know your experimental
design
Intergenic regions
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Systematic Bias
Hybridisation (operator?) Bias
• Poor technique during
hybridisation can cause a
spatial bias
• Operator is one of the largest
sources of systematic bias
• Experiments done by the
same operator often cluster
together more tightly than
warranted by the biology
• Solution: Consistent
methods, successful
techniques
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Normalisation
…and other beautifying techniques
Technique
Choices
Aim (Real)
Aim (Ideal)
Transformation
“To Near Normality”
Log2
Numerically
tractable
Gaussian
Normalisation
“Location”
Location Parameter:
1. Mean
2. Median
3. Regression(s)
(LOWESS)
Account for
systematic
effects
Gaussian
Standardisation
“Scale”
Scale Parameter
Stabilise
variance
Gaussian
Lin-Log
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Normalisation
Transformation …to near normality
Solution: Explore the entire Box-Cox
family of power transformations:
 x  1
( ) 
x  
 ln( x)
 0
 0

1  1 n ( )
( ) 2
l ( )   ln   ( x j  x ) 
2  n j 1

n
 ( 1)  ln( x j )
j 1
Maximum at λ  0, hence
use the log-transformation
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Normalisation
Transformation …to near normality
Raw Data
…exponential-like
Log2 Transformed
…normal-like
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Normalisation
Transformation …to near normality
Lin-Log Transformation
x   1
 log 2 ( x)

( ) 
x 
log ( ) 1  x
x 
 2

x = background corrected = Fg - Bg
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Normalisation
Transformation …to near normality
• The Edwards’ transformation as well as the Lin-Log
transformation are an attempt to use the entire data, not
only those for which foreground is greater than background.
• The reasoning is that errors are linear and multiplicative for
small and large signals, respectively.
• The search for and choice of  could be rather unconvincing
(eg. Different for different array slides).
• Solution: Use Log2 if Foreground > Background
Otherwise, use a small arbitrary value (say 0),
Or simply disregard.
Alternatively: Use only Foreground and Log2 it
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
Log2(R/G) – c = M - c
Location Parameter
GLOBAL:
Mean:
Median:
c = Mean of M’s
c = Median of M’s
 Assumption: Changes roughly symmetric around Mean or Median
LOWESS:
c = Weighted Regress of M on A
 Assumption: Changes roughly symmetric at all intensities
LOCAL:
LOWESS:
c = c(i) = Weighted Regression
of M on A within print-tip-group i
LOWESS = Locally WEighted Regression and Smoothing Scatterplots
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
LOWESS = Locally WEighted Regression and
Smoothing Scatterplots
Source: G Rosa 2003.
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
LOWESS = Locally WEighted Regression and
Smoothing Scatterplots
SAS Code
Source: G Rosa 2003.
Genetic analysis of complex traits using SAS
ISBN 1-59047-507-0
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
LOWESS = Locally WEighted Regression and
Smoothing Scatterplots
Normalised
Intensities
Source: G Rosa 2003.
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
LOWESS = Locally WEighted Regression and
Smoothing Scatterplots
Source: G Rosa 2003.
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
None
Source: Yang et al 2002
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
After Global Median
Source: Yang et al 2002
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
Global Lowess
Source: Yang et al 2002
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
Print-in-Group Lowess
Source: Yang et al 2002
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
After Print-in-Group Lowess
Source: Yang et al 2002
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Location Normalisation
Additional Assumption (other than symmetry of changes):
The proportion of genes that are
Differentially Expressed (DE) is minimal
Question:
Which genes to use?
Answer:
Only the ones (housekeeping) that we
know are not DE
Comment:
“Boutique” arrays become a nuisance
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Scale Normalisation (Standardisation)
“Some scale adjustments may be required so that the relative expression
levels from one particular experiment (slide) do not dominate the average
relative expression levels across replicate experiments.”
Yang et al 2002
Log2(R/G) – c(i)
a(i)
Notes: 1. The scaling a(i) is such that Var(M) = a(i)2 2
2. The estimation requires an approximation
(“robust”) to the geometric mean:
MAD
i
I

I
i 1
MADi
where MAD is the Median Absolute Deviation.
3. It doesn’t get any more heuristic (funnier?) than this
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Normalisation
…and other beautifying techniques
Notes:
1. Except Log2, everything else applies only to Ratios: M = log2(R/G)
2. Except Log2, everything else applies only within slide
3. Everything is beautified to identify DE genes straight from MA-plot,
either from a single slide or from a function of M’s across slides.
4. The uncertainty in measurements increases as intensity decreases
5. Measurements close to the detection limit are the most uncertain
(cf. Sensitivity)
6. Fold-change measurements ignore these effects
7. We can calculate an intensity-dependent z-score that measures the
ratio relative to the standard deviation in the data
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Data Normalisation
…and other beautifying techniques
2
Locally estimated standard
deviation of positive ratios
1
2-fold
Z= 1
0
Z= -1
2-fold
1
Locally estimated standard
deviation of negative ratios
2
2
.
0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Z= 5
Corrected Log10 ( Ratio )
Corrected Log10 ( Ratio )
2
Z= 2
1
Z= 1
2-fold
0
2-fold
Z= -1
1
Z= -2
Z= -5
Z= -5
2
2
.
0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Mean ( Log10 ( Intensity ) )
Mean ( Log10 ( Intensity ) )
Local Log10 ( Ratio ) Z-Score
1
0
5
Z > 2 is at the
~ 95% confidence level
0
5
Z= 5
1
0
2
.
0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Source: J Pevsner 2004
Mean ( Log10 ( Intensity ) )
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Normalisation: References
Bilban M, Buehler LK, Head S, Desoye G, Quaranta V. Normalizing DNA microarray
data. Curr Issues Mol Biol. 2002 Apr;4(2):57-64.
Durbin BP, Hardin JS, Hawkins DM, Rocke DM. A variance-stabilizing transformation
for gene-expression microarray data. Bioinformatics. 2002 Jul;18 Suppl 1:S105-10.
Kepler TB, Crosby L, Morgan KT. Normalization and analysis of DNA microarray data
by self-consistency and local regression. Genome Biol. 2002 Jun
28;3(7):RESEARCH0037.
Schuchhardt, J., D. Beule, et al. Normalization Strategies for cDNA Microarrays. NAR
2000 28(10): E47-e47.
Tran PH, Peiffer DA, Shin Y, Meek LM, Brody JP, Cho KW. Microarray optimizations:
increasing spot accuracy and automated identification of true microarray signals. Nucleic
Acids Res. 2002 Jun 15;30(12):e54.
Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. Issues in cDNA microarray analysis:
quality filtering, channel normalization, models of variations and assessment of gene
effects. Nucleic Acids Res. 2001 Jun 15;29(12):2549-57.
Tsodikov A, Szabo A, Jones D. Adjustments and measures of differential expression for
microarray data. Bioinformatics. 2002 Feb;18(2):251-60.
Yang MC, Ruan QG, Yang JJ, Eckenrode S, Wu S, McIndoe RA, She JX. A statistical
method for flagging weak spots improves normalization and ratio estimates in
microarrays. Physiol Genomics. 2001 Oct 10;7(1):45-53.
Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA
microarray data: a robust composite method addressing single and multiple slide
systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15.
Armidale Animal Breeding Summer Course, UNE, Feb. 2006