Transcript 91.510_ch7

September 24, 2003
Microarray data analysis
Copyright notice
Many of the images in this powerpoint presentation
are from Bioinformatics and Functional Genomics
by Jonathan Pevsner (ISBN 0-471-21004-8).
Copyright © 2003 by John Wiley & Sons, Inc.
These images and materials may not be used
without permission from the publisher. We welcome
instructors to use these powerpoints for educational
purposes, but please acknowledge the source.
The book has a homepage at http://www.bioinfbook.org
Including hyperlinks to the book chapters.
Microarray data analysis
• begin with a data matrix (gene expression values
versus samples)
Page 190
Microarray data analysis
• begin with a data matrix (gene expression values
versus samples)
Typically, there are
many genes
(>> 10,000) and
few samples (~ 10)
Page 190
Microarray data analysis
• begin with a data matrix (gene expression values
versus samples)
Preprocessing
Inferential statistics
Descriptive statistics
Page 190
Microarray data analysis: preprocessing
Observed differences in gene expression could be
due to transcriptional changes, or they could be
caused by artifacts such as:
• different labeling efficiencies of Cy3, Cy5
• uneven spotting of DNA onto an array surface
• variations in RNA purity or quantity
• variations in washing efficiency
• variations in scanning efficiency
Page 191
Microarray data analysis: preprocessing
The main goal of data preprocessing is to remove
the systematic bias in the data as completely as
possible, while preserving the variation in gene
expression that occurs because of biologically
relevant changes in transcription.
A basic assumption of most normalization procedures
is that the average gene expression level does not
change in an experiment.
Page 191
Data analysis: global normalization
Global normalization is used to correct two or more
data sets. In one common scenario, samples are
labeled with Cy3 (green dye) or Cy5 (red dye) and
hybridized to DNA elements on a microrarray. After
washing, probes are excited with a laser and detected
with a scanning confocal microscope.
Page 192
Data analysis: global normalization
Global normalization is used to correct two or more
data sets
Example: total fluorescence in
Cy3 channel = 4 million units
Cy 5 channel = 2 million units
Then the uncorrected ratio for a gene could show
2,000 units versus 1,000 units. This would artifactually
appear to show 2-fold regulation.
Page 192
Data analysis: global normalization
Global normalization procedure
Step 1: subtract background intensity values
(use a blank region of the array)
Step 2: globally normalize so that the average ratio = 1
(apply this to 1-channel or 2-channel data sets)
Page 192
Microarray data preprocessing
Some researchers use housekeeping genes for
global normalization
Visit the Human Gene Expression (HuGE) Index:
www.HugeIndex.org
Page 192
Scatter plots
Useful to represent gene expression values from
two microarray experiments (e.g. control, experimental)
Each dot corresponds to a gene expression value
Most dots fall along a line
Outliers represent up-regulated or down-regulated genes
Page 193
Scatter plot analysis of microarray data
Page 193
Differential Gene Expression
in Different Tissue and Cell Types
Fibroblast
Brain
Astrocyte
Astrocyte
Expression level (sample 2)
high
low
Expression level (sample 1)
Page 193
Log-log
transformation
Page 195
Scatter plots
Typically, data are plotted on log-log coordinates
Visually, this spreads out the data and offers symmetry
time behavior
t=0
basal
raw ratio
value
1.0
log2 ratio
value
0.0
t=1h no change
1.0
0.0
t=2h 2-fold up
2.0
1.0
t=3h 2-fold down
0.5
-1.0
Page 194, 197
expression level
low
high
Log ratio
up
down
Mean log intensity
Page 196
SNOMAD converts array data to scatter plots
http://snomad.org
1
EXP
3
0
EXP
2
2
0
1
0
Log-log
plot
0
1
0
0
2
1
0 2
0 3
0 4
0 2
1
CON
0
1
2
CON
EXP > CON
1
.
0
0
.
5
2-fold
0
.
0
2-fold
EXP < CON
Log10 (Ratio )
Linear-linear
plot
4
0
0
.
5
1
.
0
1
0
1
Mean ( Log10 ( Intensity ) )
Page 196-197
SNOMAD corrects local variance artifacts
residual
Log10 ( Ratio )
0
.
5
0
.
0
0
.
5
1
.
0
0
.
5
2-fold
0
.
0
2-fold
0
.
5
EXP < CON
robust local
regression fit
EXP > CON
1
.
0
Corrected Log10 ( Ratio )
[residuals]
1
.
0
1
.
0
1
0
1
Mean ( Log10 ( Intensity ) )
1
0
1
Mean ( Log10 ( Intensity ) )
Page 196-197
SNOMAD describes regulated genes in Z-scores
Local Log10 ( Ratio ) Z-Score
1
0
5
Corrected Log10 ( Ratio )
2
Locally estimated standard
deviation of positive ratios
1
0
5
Z= 1
2-fold
Z= -1
2-fold
0
Z= 5
1
0
2
.
0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Mean ( Log10 ( Intensity ) )
1
2
Locally estimated standard
deviation of negative ratios
Mean ( Log10 ( Intensity ) )
Corrected Log10 ( Ratio )
2
2
.
0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Z= 5
Z= 2
1
Z= 1
2-fold
0
2-fold
Z= -1
1
Z= -2
Z= -5
Z= -5
2
2
.
0
1
.
5
1
.
0
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Mean ( Log10 ( Intensity ) )
Inferential statistics
Inferential statistics are used to make inferences
about a population from a sample.
Hypothesis testing is a common form of inferential
statistics. A null hypothesis is stated, such as:
“There is no difference in signal intensity for the gene
expression measurements in normal and diseased
samples.” The alternative hypothesis is that there
is a difference.
We use a test statistic to decide whether to accept or
reject the null hypothesis. For many applications,
we set the significance level a to p < 0.05.
Page 199
Inferential statistics
A t-test is a commonly used test statistic to assess
the difference in mean values between two groups.
t=
x1 – x2
s
difference between mean values
=
variability (noise)
Questions
Is the sample size (n) adequate?
Are the data normally distributed?
Is the variance of the data known?
Is the variance the same in the two groups?
Is it appropriate to set the significance level to p < 0.05?
Page 199
Inferential statistics
Paradigm
Parametric test
Nonparametric
Compare two
unpaired groups
Unpaired t-test
Mann-Whitney test
Compare two
paired groups
Paired t-test
Wilcoxon test
Compare 3 or
more groups
ANOVA
Page 198-200
Inferential statistics
Is it appropriate to set the significance level to p < 0.05?
If you hypothesize that a specific gene is up-regulated,
you can set the probability value to 0.05.
You might measure the expression of 10,000 genes and
hope that any of them are up- or down-regulated. But
you can expect to see 5% (500 genes) regulated at the
p < 0.05 level by chance alone. To account for the
thousands of repeated measurements you are making,
some researchers apply a Bonferroni correction.
The level for statistical significance is divided by the
number of measurements, e.g. the criterion becomes:
p < (0.05)/10,000 or p < 5 x 10-6
Page 199
Significance analysis of microarrays (SAM)
SAM -- an Excel plug-in (URL: page 202)
-- modified t-test
-- adjustable false discovery rate
Page 200
Page 202
observed
upregulated
expected
downregulated
Page 202
Descriptive statistics
Microarray data are highly dimensional: there are
many thousands of measurements made from a small
number of samples.
Descriptive (exploratory) statistics help you to find
meaningful patterns in the data.
A first step is to arrange the data in a matrix.
Next, use a distance metric to define the relatedness
of the different data points. Two commonly used
distance metrics are:
-- Euclidean distance
-- Pearson coefficient of correlation
203
Data matrix
(20 genes and
3 time points
from Chu et al.)
Page 205
t=2.0
t=0.5
t=0
3D plot (using S-PLUS software)
Page 205
Descriptive statistics: clustering
Clustering algorithms offer useful visual descriptions
of microarray data.
Genes may be clustered, or samples, or both.
We will next describe hierarchical clustering.
This may be agglomerative (building up the branches
of a tree, beginning with the two most closely related
objects) or divisive (building the tree by finding the
most dissimilar objects first).
In each case, we end up with a tree having branches
and nodes.
Page 204
Algorithmic Techniques
• Hierarchical
• K-Nearest Neighbors (K-Means, K-Median)
• Neural Networks
• Self-Organizing Maps
• Principal Component Analysis
Agglomerative clustering
0
1
2
3
4
a
b
a,b
c
d
e
Page 206
Agglomerative clustering
0
1
2
3
4
a
b
a,b
c
d
e
d,e
Page 206
Agglomerative clustering
0
1
2
3
4
a
b
a,b
c
d
e
c,d,e
d,e
Page 206
Agglomerative clustering
0
1
2
3
4
a
b
a,b
a,b,c,d,e
c
d
e
c,d,e
d,e
…tree is constructed
Page 206
Divisive clustering
a,b,c,d,e
4
3
2
1
0
Page 206
Divisive clustering
a,b,c,d,e
c,d,e
4
3
2
1
0
Page 206
Divisive clustering
a,b,c,d,e
c,d,e
d,e
4
3
2
1
0
Page 206
Divisive clustering
a,b
a,b,c,d,e
c,d,e
d,e
4
3
2
1
0
Page 206
Divisive clustering
a
b
a,b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
…tree is constructed
Page 206
agglomerative
0
1
2
3
4
a
b
a,b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
divisive
Page 206
Page 205
Page 207
1
12
Agglomerative and
divisive clustering
sometimes give conflicting
results, as shown here
1
12
Page 207
Cluster and TreeView
Page 208
Cluster and TreeView
clustering K means SOM PCA
Page 208
Cluster and TreeView
Page 208
Cluster and TreeView
Page 208
Page 208
Page 208
Page 208
Two-way
clustering
of genes (y-axis)
and cell lines
(x-axis)
(Alizadeh et al.,
2000)
Page 209
Self-organizing maps (SOM)
To download GeneCluster:
http://www.genome.wi.mit.edu/MPR/software.html
Self-organizing maps (SOM)
One chooses a geometry of 'nodes'-for example, a 3x2 grid
Page 210
http://www.genome.wi.mit.edu/MPR/SOM.html
Self-organizing maps (SOM)
The nodes are mapped into k-dimensional space,
initially at random and then successively adjusted.
Page 210
Self-organizing maps (SOM)
Page 211
Unlike k-means clustering, which is unstructured, SOMs allow one to impose
partial structure on the clusters. The principle of SOMs is as follows.
One chooses an initial geometry of “nodes” such as a 3 x 2 rectangular grid
(indicated by solid lines in the figure connecting the nodes). Hypothetical
trajectories of nodes as they migrate to fit data during successive iterations
of SOM algorithm are shown. Data points are represented by black dots,
six nodes of SOM by large circles, and trajectories by arrows.
Self-organizing maps (SOM)
Neighboring nodes tend to define 'related' clusters.
An SOM based on a rectangular grid thus is analogous
to an entomologist's specimen drawer in which
adjacent compartments hold similar insects.
Two pre-processing steps essential to apply SOMs
1. Variation Filtering:
Data were passed through a variation filter to eliminate
those genes showing no significant change in
expression across the k samples. This step is needed
to prevent nodes from being attracted to large sets
of invariant genes.
2. Normalization:
The expression level of each gene was normalized
across experiments. This focuses attention on the
'shape' of expression patterns rather than absolute
levels of expression.
Principal component axis #2
(10%)
Principal components analysis (PCA),
an exploratory technique that reduces data dimensionality,
distinguishes lead-exposed from control cell lines
P4
N2
Legend
Lead (P)
C2
P1
N3
P2 P3
C3
C4
N4
Sodium (N)
Control (C)
C1
Principal component axis #1
(87%)
Principal components analysis (PCA)
An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
For a matrix of m genes x n samples, create a new
covariance matrix of size n x n
Thus transform some large number of variables into
a smaller number of uncorrelated variables called
principal components (PCs).
Page 211
Principal components analysis (PCA): objectives
• to reduce dimensionality
• to determine the linear combination of variables
• to choose the most useful variables (features)
• to visualize multidimensional data
• to identify groups of objects (e.g. genes/samples)
• to identify outliers
Page 211
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
Page 212
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
Page 212
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
Page 212
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
Page 212
Page 212
Page 212
Use of PCA to demonstrate increased levels of gene
expression from Down syndrome (trisomy 21) brain
Chr 21