Transcript microarrays
Microarray Technology
and
Data Analysis
(November 28, 2007)
slides assembled by Dong-Guk Shin and J Peter Gogarten
Introduction to Microarray
Technology
Two color microarrays:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
two conditions
two labels for cDNA
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
develop slide with mRNAs
hybridize mixture of both probes
to printed glass slides
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
make images, one for each probe
fuse in computer
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
An alternative is to synthesize the DNA directly onto the
matrix (slides from Affymetrix)
created through photolithography
on cell in array
hybridization to labeled RNA from sample
result of hybridization to array
Experimental design and
sources for variation
Effect Size
Biological Variation -
Array &
Environment
Variation
Rules of thumb:
•Biological Replicates are a must!
•As many biological replicates
as you can afford!
•Cell population as homogeneous
as possible!
Sample
Processing
Variation
Technical Variation
E.g.: Two mice in two different cages
“One characteristic common to all biological material is that it varies.”
Finney, 1953
Control of Experiment Variance
“If I had to replicate my experiments, I could only do half as much.”
Botstein, 1999
Technical Replication
•Technical variance 0
•High Precision
experiment
•Technical Replication: Estimation
of technical Variation
•biological effect inaccurate
Biological Replication
•Biological variance 0
•High accuracy experiment
•Biological and technical variation
are confounded
•Measurement precision
decreased
Degree of Replication
•Robustness of the method Spot replication
•Dye Swap array replication
•Robustness of the biological assay
•Absolute Transcript frequency/signal intensity Sample replication
•Relative Transcript frequency associated with the biological effect
Sample replication
•Cellular sample composition
Sample replication
Statistical Analysis and Design
The number of independent data points is a function of the comparison
design:
Single Color
•Post hoc comparison
•
Two Color
•Direct comparison
•
Post Hoc Design
–
Loop Design (Balanced)
–
–
•Indirect comparison
•
2 data point/gene/condition
biological and technical variation
not confounded
8 datapoints/gene/condition
Reference Design (Unbalanced)
–
–
–
biological and technical variation
not confounded
Reference overrepresented
4 data point/gene/condition
Pooling
A reference design: the red and green arrows represent chips.
from http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp
A loop design: arrows represent chips
with samples labeled as indicated.
A saturated design w/o dye swap
A design for a comparative study of the effect of a treatment
on two biological strains with replicates and a few dye swaps
from http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp
Topic 2
Data Preprocessing
• Background Correction
• Normalization
Background Correction
•
•
•
•
None
– DNA vs Substrate
– No Imputation/Offset
Local
– Negative Signal Intensities
likely
– Imputation/Offset required
Global
– Negative Signal Intensities
likely
– Imputation/Offset required
Moving Minimum
– 3x3 spot average background
– Negative Signal Intensities
likely
– Imputation/Offset required
•
•
Edwards
– log-linear interpolation of
background intensities
– Background Intensity insensitive
– Test for Imputation
Norm-Exp
– regression based background
estimation using Signal to Noise
ratios
– Background Intensity sensitive
– No Imputation
QuickTime™ and a
decompressor
are needed to see this picture.
Normalization
Background correction
Expression ratio: Ti= Ri/Gi
log2(ratio)
log2(1) = 0, log2(2) = 1, log2(1/2) = -1, log2(4) = 2, log2(1/4) = -2
total intensity normalization: If one has a large random sample of genes most
of which remain unchanged, one could normalize so that the mean ratio (T)
for all spots is 1.
(for the log2Ti correction this corresponds to a subtraction of a constant.
see http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf )
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Two Color Analytical Plots
Cond.1a
Cond.2a
Cond.1b
Cond.2b
Cond.1c
Comparison
Cond.2c
Synth. Image
Scatterplot
Ratiohistogram
typical depiction ratio versus intensity (log R +log G)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
From: http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf
after locally weighted linear regression analysis
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
From: http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf
Beware
Any data adjustment, even if it performed as
sophisticated or industrious as possible, cannot
convert low quality data into high quality data.
Data adjustment always removes a part of the
biology.
!!Use it as sparingly as possible!!
Filtering Data
Outliers in the
original data (in red)
are excluded from
the remainder of the
data (blue) selected
on the basis of a
two-standarddeviation cut on the
replicates.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
From: http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf
Statistical Methods for
Identifying Differentially
Expressed Genes in
Replicated Microarray
Experiments
Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
N rows correspond to the
N genes.
M columns correspond to the
M samples (microarray
experiments).
Each column = a sample or
a replicate
Example:
Four replicate spots per array
produces four column R/G ratio.
If four replicate arrays are used,
It will produce a 16 column matrix.
Or 32 if R and G values are put
separately.
Student’s Test Statistics
H0: The groups are not different
99% 95% 68% of all samples
Naïve
solution: do t-test for each gene.
Multiplicity Problem: The probability of error
increases.
(Bonferoni correction too conservative!)
Significance Analysis of Microarrays
Linear Models for Microarray Data
Package to analyze MA data.
Good plot capabilities.
semi-parametric hierarchical (SPH) mixture model
Significance Analysis of
Microarrays (SAM)
uses balanced
permutations (sample
versus control intensities
“re-labeling”) to
generate an expectation
for the comparison
Volcano plots
compare significance (Y-axis) against effect (x
axis)
•The plot compares significance
determinations obtained with MAANOVA
(MicroArray ANalysis Of VAriance)
•On the plot, the y-axis value is -log10(Pvalue) for the F1 test. The x-axis value is
proportional to the fold changes.
•A horizontal line represents the
significance threshold of the F1 test.
•Blue dots: EE genes
•Green dots: F3
•Orange dots: Fs
•F2 (In example graph, F2 tests
were not run.)
Microarray Data: Clustering
Clustering
Assign n similar objects to groups
Example: green/ red data points were generated from two different normal distributions
Why cluster genes?
• Identify groups of possibly co-regulated genes
• Identify temporal or spatial gene expression patterns
Why cluster experiments/samples?
• Detect experimental artifacts/bad hybridizations
• Identify new classes of biological samples (e.g. tumor
subtypes)
To Do Clustering You Need …
Distance measure
(Example: Intra-Cluster Distances gene expression # 1: x = (x1, …,
for hierarchical clustering)
xn),
gene expression # 2: y = (y1, …,
yn)
•
Euclidean:
• Manhattan:
n
2
(
x
y
)
i i
d E ( x, y )
i 1
n
d M ( x, y ) xi - yi .
i 1
• Correlation:
d C ( x, y ) 1 -
( x - x )( y
i
i 1
i
- y)
(x - x) ( y
2
i 1
i
i 1
i
- y)
2
.
To Do Clustering You Also Need …
Cluster Algorithm/Method
(1) Hierarchical
(2) Parametric
(Partitioning)
Basic Idea
•small within-cluster
distances
• large between-cluster
distances
Hierarchical
Clustering
1 5 2 3 4
Divisive
3
5
1
Agglomerative
1,2,3,4,5
4
1,2,5
3,4
1,5
2
1
5
2 3
4
Hierarchical clustering
Clustered display of data from time course of serum
stimulation of primary human fibroblasts (grown in
culture and deprived of serum for 48 hr, serum was
added back and samples taken at time 0, 15 min, 30
min, 1 hr, 2 hr, 3 hr, 4 hr, 8 hr, 12 hr, 16 hr, 20 hr, 24
hr). All measurements are relative to time 0. Genes
were selected for this analysis if their expression
level deviated from time 0 by at least a factor of 3.0
in at least 2 time points. Each gene is represented by
a single row of colored boxes; each time point is
represented by a single column.
Labeled clusters contain multiple genes involved in
(A) cholesterol biosynthesis, (B) the cell cycle, (C)
the immediate-early response, (D) signaling and
angiogenesis, and (E) wound healing and tissue
remodeling. These clusters also contain named genes
not involved in these processes and numerous
uncharacterized genes.
Eisen, Michael B. et al. (1998) Proc. Natl. Acad.
Sci. USA 95, 14863-14868
Copyright ©1998 by the National Academy of Sciences
Volcano Plot and Heatmaps
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Parametric Clustering (partitioning)
• K-Means
• K-Medoids (PAM)
• SOM
• Fuzzy-C Means
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in 4
steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the
current partition. The centroid is the center (mean point) of
the cluster.
– Assign each object to the cluster with the nearest seed point.
– Go back to Step 2, stop when no more new assignment.
Parametric or Hierarchical (Non-Parametric)?
Parametric:
Advantages
Optimal for certain criteria.
Genes automatically
assigned to clusters
Disadvantages
Need initial k;
Often require long
computation times.
Every gene is assigned to a
cluster.
Hierarchical
Advantages
Faster computation.
Visual Representation.
Disadvantages
Unrelated genes are
eventually joined
Rigid, cannot correct later for
erroneous decisions made
earlier.
Hard to define clusters.
Meta Analyses of MA data:
Go Analysis
Pathway Analysis
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
To do:
For Friday
–Read chapter 18
–Browse through http://jura.wi.mit.edu/bio/education/bioinfo/lecture10-color.pdf
and
–http://www.nature.com/cgitaf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf
For Monday:
–Refresh your memory on McRobot and Bayesian analyses
–Go through quiz 8 (will be posted Friday/Saturday, due following Wednesday)