Lecture notes

Download Report

Transcript Lecture notes

Normalization
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612
The DNA Array Analysis Pipeline
Question
Experimental Design
Array design
Probe design
Sample Preparation
Hybridization
Buy Chip/Array
Image analysis
Normalization
Expression Index
Calculation
Comparable
Gene Expression Data
Statistical Analysis
Fit to Model (time series)
Advanced Data Analysis
Clustering
Meta analysis
PCA
Classification
Survival analysis
Promoter Analysis
Regulatory Network
DNA Microarray Bioinformatics - #27612
Intensities are not just mRNA
concentrations
•
•
•
•
•
•
Tissue contamination
RNA degradation
RNA purification
Reverse transcription
Amplification efficiency
Dye effect (cy3/cy5)
• Spotting
• DNA-support binding
• Other issues related to
array manufacturing
• ‘Background’ correction
• Image segmentation
• Hybridization efficiency
and specificity
• Spatial effects
DNA Microarray Bioinformatics - #27612
Two kinds of variation
Global variation
Amount of RNA in the biopsy
Efficiencies of:
–
–
–
–
–
RNA extraction
Reverse transcription
amplification
Labeling
Photodetection
Systematic
Gene-specific variation
Spotting efficiency,
– Spot size
– Spot shape
Cross-/unspecific
hybridization
Biological variation
– Effect
– Noise
Stochastic
DNA Microarray Bioinformatics - #27612
Stochastic noise we use statistics to deal with
PCA Plot of 34 patients, 8973 dimensions (genes) reduced to 2
DNA Microarray Bioinformatics - #27612
...like we will see later
PCA for 100 most significant genes reduced to 2 dimensions
DNA Microarray Bioinformatics - #27612
Sources of variation
Global variation:
Gene-specific variation:
Systematic
Stochastic
• Similar effect on many
• Too random to be explicitly
measurements
• Corrections can be
estimated from data
accounted for
• “noise”
Normalization
Statistical testing
DNA Microarray Bioinformatics - #27612
Calibration = Normalization = Scaling
DNA Microarray Bioinformatics - #27612
Nonlinear normalization
DNA Microarray Bioinformatics - #27612
The Qspline method
From the empirical distribution, a number of quantiles are calculated for
each of the channels to be normalized (one channel shown in red) and for
the reference distribution (shown in black)
A QQ-plot is made and a normalization curve is constructed by fitting a
cubic spline function
As reference one can use an artificial “median array” for a set of arrays
or use a log-normal distribution, which is a good approximation.
DNA Microarray Bioinformatics - #27612
Once again…qspline
Accumulating quantiles
When many microarrays are to be
normalized to each other an average
array can be used as target
DNA Microarray Bioinformatics - #27612
Lowess Normalization
*
M
*
*
*
* *
*
A
One of the most commonly utilized normalization
techniques is the LOcally Weighted Scatterplot
Smoothing (LOWESS) algorithm.
DNA Microarray Bioinformatics - #27612
Invariant set normalization (Li and Wong)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW ) decompressor
are needed to see this picture.
A invariant set of probes is used
-Probes that does does not change intensity rank between arrays
-A piecewise linear median line is calculated
-This curve is used for normalization
DNA Microarray Bioinformatics - #27612
Spatial normalization
Raw data
After intensity
normalization
Spatial bias
estimate
After spatial
normalization
DNA Microarray Bioinformatics - #27612
The DNA Array Analysis Pipeline
Question
Experimental Design
Array design
Probe design
Sample Preparation
Hybridization
Buy Chip/Array
Image analysis
Normalization
Expression Index
Calculation
Comparable
Gene Expression Data
Statistical Analysis
Fit to Model (time series)
Advanced Data Analysis
Clustering
Meta analysis
PCA
Classification
Survival analysis
Promoter Analysis
Regulatory Network
DNA Microarray Bioinformatics - #27612
Expression index value
Some microarrays have multiple probes addressing
the expression of the same gene
– Affymetrix chips have 11-20 probe pairs pr. Gene
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
PM:
MM:
- Perfect Match (PM)
- MisMatch (MM)
CGATCAATTGCACTATGTCATTTCT
CGATCAATTGCAGTATGTCATTTCT
DNA Microarray Bioinformatics - #27612
Expression index calculation
Simplest method? Median
But more sophisticated methods exists:
dChip, RMA and MAS 5 (from Affymetrix)
DNA Microarray Bioinformatics - #27612
dChip (Li & Wong)
Model:
PMij = qifj + eij
Outlier removal:
– Identify extreme residuals
– Remove
– Re-fit
– Iterate
Distribution of errors eij assumed
independent of signal strength
(Li and Wong, 2001)
DNA Microarray Bioinformatics - #27612
RMA
Robust Multi-array Average (RMA) expression
measure (Irizarry et al., Biostatistics, 2003)
For each probe set, re-write PMij = qifj as:
log(PMij)= log(qi ) + log(fj)
Fit this additive model by iteratively re-weighted
least-squares or median polish
DNA Microarray Bioinformatics - #27612
MAS. 5
MicroArray Suite version 5 uses
signal  TukeyBiweight{log( PM j  MM *j )}
MM* is an adjusted MM that is never bigger than PM
Tukey biweight is a robust average procedure with weights
and outlier rejection
DNA Microarray Bioinformatics - #27612
Methods compared on expression variance
Std Dev of gene measures from 20 replicate arrays
Std Dev of gene measures from 20 replicate arrays
Expression level
Blue and Red: RMA; Black: dChip; Green: MAS5.0
From Terry speed
DNA Microarray Bioinformatics - #27612
Robustness
MAS 5.0
Log fold change estimate from 20ug cRNA
MAS5.0
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Log fold change estimate from 1.25ug cRNA
(Irizarry et al., Biostatistics, 2003)
DNA Microarray Bioinformatics - #27612
Robustness
dChip
Log fold change estimate from 20ug cRNA
dChip
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Log fold change estimate from 1.25ug cRNA
(Irizarry et al., Biostatistics, 2003)
DNA Microarray Bioinformatics - #27612
Robustness
RMA
Log fold change estimate from 20ug cRNA
RMA
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Log fold change estimate from 1.25ug cRNA
(Irizarry et al., Biostatistics, 2003)
DNA Microarray Bioinformatics - #27612
All of this is implemented in…
R
In the BioConductor packages ‘affy’
(Gautier et al., 2003).
DNA Microarray Bioinformatics - #27612
References
Li and Wong, (2001). Model-based analysis of oligonucleotide arrays: Model
validation, design issues and standard error application.
Genome Biology 2:1–11.
Irizarry, Bolstad, Collin, Cope, Hobbs and Speed, (2003) Summaries of Affymetrix
GeneChip probe level data.
Nucleic Acids Research 31(4):e15.)
Affymetrix. Affymetrix Microarray Suite User Guide. Affymetrix, Santa Clara, CA,
version 5 edition, 2001.
Gautier, Cope, Bolstad, and Irizarry, (2003). affy - an r package for the analysis of
affymetrix genechip data at the probe level. Bioinformatics
DNA Microarray Bioinformatics - #27612