Microarray Technology

Download Report

Transcript Microarray Technology

Introduction to Microarrays
Kellie J. Archer, Ph.D.
Assistant Professor
Department of Biostatistics
[email protected]
Microarrays
A snapshot that captures the activity
pattern of thousands of genes at once.
Custom spotted arrays
Affymetrix GeneChip
Spotted Microarray
Process
CTRL
TEST
Affymetrix GeneChip®
Probe Arrays
GeneChip Probe Array
Hybridized Probe Cell
Single stranded, fluorescently
labeled DNA target
Oligonucleotide probe
1.28cm
*
*
*
*
*
24µm
Each probe cell or feature contains
millions of copies of a specific
oligonucleotide probe
Over 250,000 different probes
complementary to genetic
information of interest
Image of Hybridized Probe Array
BGT108_DukeUniv
Applications of microarrays
• Cancer research: Molecular characterization
of tumors on a genomic scale; more reliable
diagnosis and effective treatment of cancer
• Immunology: Study of host genomic responses
to bacterial infections
• Model organisms: Multifactorial experiments
monitoring expression response to different
treatments and doses, over time or in
different cell types
• etc.
Applications of Microarrays
• Compare mRNA transcript levels in
different type of cells, i.e., vary
–
–
–
–
–
–
Tissue (liver vs. brain);
Treatment (Drugs A, B, and C);
State (tumor vs. normal);
Organism (yeast, different strains);
Timepoint;
etc.
Affymetrix Design
11 – 20 Probe Pairs interrogate each gene
GCGCCGGCTGCAGGAGCAGGAGGAG
GCGCCGGCTGCACGAGCAGGAGGAG
PM
MM
Image Analysis: Pixel Level Data
6 x 6 matrix of pixels for each PM and MM probe
HG-U133A GeneChip
Expression Quantification
PM and MM intensities are combined to form an
expression measure for the probe set (gene)
GCGCCGGCTGCAGGAGCAGGAGGAG
GCGCCGGCTGCACGAGCAGGAGGAG
PM
MM
Expression Quantification
• Initially, Affymetrix signal was calculated as
where j indexes the probe pairs for each probe
set A. This is known as the “Average
Difference” method.
• Problems:
– Large variability in PM-MM
– MM probes may be measuring signal for another
gene/EST
– PM-MM calculations are sometimes negative
Expression Quantification
•
•
•
•
The mean of a random variable X is a
measure of central location of the
density of X.
The variance of a random variable is a
measure of spread or dispersion of the
density of X.
Var(X)=E[(X-)2] =E(X2) - 2
Standard deviation = Var(X) =
Expression Quantification
Illustration:
Average Difference.xls
Sources of Obscuring Variation
in Microarray Measurements
• Sample handling (degree of physical
manipulation, time from extripation to
freezing)
• Microarray manufacture
• Sample processing (extraction procedure,
RNA integrity & purity, RNA labeling)
• Processing differences (hybridization
chambers, washing modules, scanners)
• Personnel differences
• Random differences in signal intensity in a
data set which co vary with the biological
process
Normalization
• The purpose of normalization is to remove
experimental artifacts of no direct interest,
that is, the removal of systematic effects
other than differential expression.
Normalization procedures often include
– background subtraction,
– detection of outliers,
– and removal of variation due to
• differences in sample preparation,
• array differences,
• differences in dye labeling efficiencies,
• and scanning differences.
16 Replicate HG-133A GeneChips, Before normalization
16 Replicate HG-133A GeneChips, After normalization
Taxonomy of Microarray Data
Analysis Methods
• Unsupervised Learning: The statistical
analysis seeks to find structure in the
data without knowledge of class labels.
• Supervised Learning: Class or group
labels are known a priori and the goal of
the statistical analysis pertains to
identifying differentially expressed
genes (AKA feature selection) or
identifying combinations of genes that
are predictive of class or group
membership.
Unsupervised Learning
• Unsupervised learning or clustering involves the
aggregation of samples into groups based on similarity
of their respective expression patterns without
knowledge of class labels.
• Examples of Unsupervised Learning methods include
– Hierarchical clustering
– k-means
– k-medoids
– Self Organizing Maps
– Principal Components
– Multidimensional Scaling
Supervised Learning
• Example methods for Class comparison/ Feature
selection include
– T-test / Wilcoxon rank sum test
– F-test / Kruskal Wallis test
– etc.
• Example methods for Class Prediction include
– Weighted voting
– K nearest neighbors
– Compound Covariate Predictors
– Classification trees
– Support vector machines
– etc.
Supervised Learning:
Class Prediction
• Risk of over-fitting the data: may have a
perfect discriminator for the data set at
hand but the same model may perform poorly
on independent data sets.
• Most prediction methods are intended for
large ‘n’ (samples) small ‘p’ (covariates)
datasets.
• Process is to
– Fit model
– Check model adequacy
– Make an inference
Class Prediction: Checking model
Adequacy
• Regardless of algorithm used, it is
essential that once the prediction rule
has been defined, an unbiased estimate
of the true error rate must be
calculated.
Class Prediction: Checking Model
Adequacy
• In a data rich situation,
– randomly divide the dataset into two parts,
representing a training and test dataset.
– Build the prediction algorithm using the
training dataset
– Once a final model has been developed, the
prediction rule is applied to the test
dataset to estimate the misclassification
error
Class Prediction: Checking Model
Adequacy
• For small sample sizes, withholding a
large portion of the data for validation
purposes may limit the ability of
developing a prediction rule. Therefore,
use cross-validation techniques to
assess the error.
Class Prediction: Checking Model
Adequacy
• K-fold cross-validation requires one to randomly split
the dataset into K equally sized groups.
• Thereafter, the model is fit to K-1 parts of the data
and the generalization error is calculated using the
Kth remaining part of the data.
• This procedure is repeated so that the generalization
error is estimated for each of the K parts of the
data, providing an overall estimate of the
generalization error and its associated standard
error.
Class Prediction: Checking Model
Adequacy
1
2
3
4
5
6
7
8
9
10
• Leave out data in group 3
• Fit the model to the data in groups 1 – 2, 4 – 10 (learning
dataset)
• Calculate the error using observations in group 3 as the
test dataset
• Do this for each of the 10 partitions