Microarray Basics: Part 2 - Department of Pathology and

Download Report

Transcript Microarray Basics: Part 2 - Department of Pathology and

Microarray Basics
Part 2: Data normalization, data
filtering, measuring variability
Log transformation of data
Most data bunched in lower
left corner
Variability increases with
intensity
Data are spread more evenly
Variability is more even
Simple global normalization
to try to fit the data
Slope does not equal 1 means one channel responds more at higher
intensity
Non zero intercept means one
channel is consistently brighter
Non straight line means non
linearity in intensity responses
of two channels
Linear regression of
Cy3 against Cy5
MA plots
Regressing one channel against the other has the disadvantage
of treating the two sets of signals separately
Also suggested that the human eye has a harder time seeing
deviations from a diagonal line than a horizontal line
MA plots get around both these
issues
Basically a rotation and rescaling of the data
X axis
A= (log2R + log2G)/2
Y axis
M= log2R-log2G
Scatter plot of intensities
MA plot of same data
Non linear normalization
Normalization that takes into account intensity effects
Lowess or loess is the locally weighted polynomial regression
User defines the size of
bins used to calculate the
best fit line
Taken from Stekal (2003) Microarray Bioinformatics
Adjusted values for the x
axis (average intensity for
each feature) calculated using
the loess regression
Should now see the data
centred around 0 and
straight across the horizontal
axis
Spatial defects over the slide
• In some cases, you may notice a spatial bias
of the two channels
• May be a result of the slide not lying
completely flat in the scanner
• This will not be corrected by the methods
discussed before
Spatial Bias
Regressions for spatial bias
• Carry out normal loess regression but treat each
subgrid as an entire array (block by block loess)
• Corrects best for artifacts introduced by the pins,
as opposed to artifacts of regions of the slide
– Because each subgrid has relatively few spots, risk
having a subgrid where a substantial proportion of spots
are really differentially expressed- you will lose data if
you apply a loess regression to that block
• May also perform a 3-D loess- plot log ratio for
each feature against its x and y coordinates and
perform regression
Between array normalization
• Previous manipulations help to correct for nonbiological differences between channels on one
array
• In order to compare across arrays, also need to
take into account technical variation between
slides
• Can start by visualizing the overall data as box
plots
• Looking at the distributions of the log ratios or the
log intensities across arrays
Extremes of
distribution
Std Dev of
distribution
with mean
Extremes of
distribution
Data Scaling
•Makes mean of
distributions equal
•Subtract mean
log ratio
from each log ratio
•Mean of measurements
will be zero
Data Centering
•Makes means and
standard deviations equal
•Do as for scaling,
but also divide by the
mean standard deviation
•Will have means intensity
measurements of zero,
standard deviations of 1
Distribution normalization
•Makes overall distributions
identical between arrays
•Centre arrays
•For each array, order centered
intensities from highest to lowest
•Compute new distribution whose
lowest value is average of all
lowest values, and so on
•Replace original data with new
values for distribution
Some key points
• Design the experiment based on the questions you
want to ask
• Look at your TIFF images
• Look at the raw data with scatter plots and MA
plots
• Normalize within arrays to remove systematic
variability between channels
• Scale between arrays prior to comparing results in
a data set
Data Filtering (flagging of data)
• Can use data filtering to remove or flag
features that one might consider to be
unreliable
• May base the filter on parameters such as
individual intensity, average feature
intensity, signal to noise ratio, standard
deviation across a feature
Using intensity filters
• Object is to remove features that have
measurements close to background levelsmay see large ratios that reflect small
changes in very small numbers
• May want to set the filter as anything less
than 2 times the standard deviation of the
background
If using signal to noise ratio, keep in mind that the numbers
calculated by QuantArray are:
spot intensity/std dev of background
Should see that the
S/N ratio increase
at higher intensity
Taken from DNA Microarray Data Analysis (CSC) http://www.csc.fi/oppaat/siru/
Removing outliers
• May want to simply remove outliers- some
estimates are that the extreme ends of the
distribution should be considered outliers
and removed (0.3% at either end)
• Also want to remove saturated values (in
either channel)
Filtering based on replicates
•
•
Consider two replicates with dyes swapped
A1 and B2
B1
A2
•
We expect to see
A1* B2 = 1
B1 A2
We can calculate  and eliminate spots with the
greatest uncertainty:  >2
Replicate Filtering
•Plot of the log
ratios of 2 replicates
•Remove the data
in red based on
deviation of 2
st dev
Taken from Quakenbush (2002) Nat Genet Supp 32
Z-scores
•
•
•
The uncertainty in measurements increases as
intensity decreases
Measurements close to the detection limit are
the most uncertain
Can calculate an intensity-dependent Z-score
that measures the ratio relative to the standard
deviation in the data:
Z = log2(R/G)-/
Intensity-dependent Z-score
Z > 2 is at the 95.5% confidence level
Approaches to using filtering algorithms
qsize
Small spots with high intensity penalized
Large spots may be print defects
qsignal to noise
Signal to noise ratio to define confidence
qlocal background
Degree of local background
qbackground variability
Variation from average background
qsaturated
Defined as a threshold, not a continuous function
qcom = composite quality score based on the continuous and
discrete functions listed above
Taken from Wang et al (2001) NAR 29: e75
qcom in relation to log ratio plot
Taken from Wang et al (2001) NAR 29: e75
Measuring and Quantifying
Variability
• Variability may be measured:
– Between replicate features on an array
– Between two replicates of a sample on an array
– Between two replicates of a sample on different
arrays
– Between different samples in a population
Quantifying variables in
microarray data
• Measured value for each feature is a
combination of the true gene expression,
and the sources of variation listed
• Each component of variation will have its
own distribution with a standard deviation
which can be measured
Variability between replicate
features
• Requires that features are printed multiple
times on a chip
• Optimal if the features are not printed side
by side
• Need to calculate this variability separately
for the 2 channels
Calculate mean of each replicate
Calculate the deviation from the mean for each replicate
Diff (Rep1)
Produce MA plots
0
If needed, can normalize
Calculate std dev of errors
If the error distribution
is ~ normal, you can
calculate v
Frequency
Ch1 ave log intensity
Ch1 Difference
Variability between channels
• Perform a self to self hybridization
• Perform all the normalization procedures
discussed earlier
• The variation that is left is going to be due
to random variability in measurement
between the 2 channels
Variability between arrays
• Same samples on different arrays (or just
use the common reference sample in a
larger experiment)
• Now are calculating both the variability due
to the manufacturing of different arrays, and
the variability of different hybridizationsthese are confounded variables
Why calculate these values?
• Gives an estimate of comparability in
quality between experiments
• Gives an estimate of noise in the data
relative to population variation
• Can be used to track optimization of
experiment
Variability between individuals
• This is the population variability number
that is used in the power calculation
• Generally will find that this is the largest
source of variation and this is the one that
will not be decreased by improving the
experimental system
How to calculate population
variability
• Calculate log ratio of each gene relative to
the reference sample
• Calculate the average log ratio for each
gene across all samples
• For each gene in each sample, subtract the
log ratio from the average log ratio
• Plot the distribution of deviations and
calculate the standard deviation (and v)
http://genome-www5.stanford.edu/mged/normalization.html
Part 3-Data Analysis
• How to choose the interesting genes in your
experiment
• How to study relationships between groups
of genes identified as interesting
• Classification of samples